Predictive Analytics for Daily Guest Counts
Phase 1 Proposal¶
Domain Understanding¶
For the domain understanding part of this project, I used my experience working at Tortillas since the summer of 2022. I've learned a lot about how restaurants work during this time. Also, because Tortillas is in a busy restaurant area, I know a lot about what happens in the neighborhood. I can easily talk to other restaurant workers to learn more. My boss at Tortillas also gave me some good ideas about what might affect how many guests we have, like PSV matches or Effenaar concerts. This helped me figure out what to look for when predicting guest numbers. To see the full document, see the project proposal.
Data Sourcing¶
For data sourcing, I initially consulted with my boss at Tortillas to explore potential sources of data for my project. Unfortunately, the available data was limited. After two weeks of waiting, I received only the order deliveries data, which did not include sales data, essential for my intended analysis. Additionally, the order deliveries were aggregated on a weekly basis, making it challenging to make daily predictions accurately. The assumption that all purchases for the week corresponded to sales for that same week was not ideal, as sales could fluctuate daily. Furthermore, due to ethical and privacy concerns, accessing additional data, such as financial records, was not possible.
Despite these limitations, I realized that I could readily access data on the number of guests visiting Tortillas each day. While this dataset alone may seem insufficient, I recognized the potential to enrich it by incorporating external factors known to influence guest numbers. Factors such as weather conditions, local events, holidays, and special occasions could significantly impact restaurant traffic. With access to this data, I could enhance the complexity of my dataset and improve the accuracy of my predictive model.
Objectives and Target Audience¶
The primary objective of my project is to predict the number of guests visiting Tortillas restaurant each day. By doing so, I aim to assist restaurant management in making informed decisions regarding staffing levels, inventory management, and overall business strategy. The target audience for this project includes restaurant owners, managers, and staff who rely on accurate forecasts to optimize daily operations.
Data Characteristics¶
To achieve our objectives, I require structured data detailing the number of guests visiting Tortillas each day, preferably over a significant historical period to establish trends and patterns. Additionally, I will need unstructured data on external factors such as weather conditions, local events, and holidays. These datasets will provide the necessary context to enhance the predictive capabilities of our model.
Data Sources¶
The data on the number of guests visiting Tortillas each day can be sourced directly from the restaurant's records, provided by my boss. For external factors such as weather conditions and local events, I will need to search the internet or reach out to relevant organizations and authorities for accurate and up-to-date information. This may involve accessing public databases, APIs, or creating an own dataset collecting information from the calendar.
Version Control¶
To maintain version control for my data, I plan to start with a dataset spanning three months of historical guest counts and external factors. As I clean, preprocess, and augment the data, I will maintain a clear version history to ensure reproducibility and scalability. If the initial model proves successful, I will consider expanding the dataset and implementing more advanced version control practices to manage ongoing updates and improvements.
Analytic Approach¶
In this phase, a specific goal and approach for the project are defined, focusing on predicting the number of guests visiting the restaurant. The objective is to determine the target variable, understand the nature of the problem, and identify key indicators within the dataset to facilitate accurate predictions.
🎯 Target Variable:¶
The target variable for the predictive model is the number of guests visiting the restaurant on a given day. This variable represents the "thing" aimed to be predicted based on other features in the dataset. By forecasting the daily guest count, assistance can be provided to restaurant management in optimizing staffing levels, inventory management, and overall business strategy.
🎨 Nature of the Problem:¶
The nature of the target variable determines the type of algorithm suitable for the predictive analytics model. In this case, it involves a regression problem as the aim is to predict a continuous numerical value – the daily guest count. Regression algorithms are well-suited for such problems as they can effectively estimate the relationship between input features and the target variable to make predictions.
📍 Good Indicators:¶
To identify good indicators for predicting the daily guest count, exploratory data analysis (EDA) is conducted to uncover correlations between the target variable and other features in the dataset. By calculating correlation coefficients, the strength and direction of relationships between variables can be determined. Features exhibiting strong correlations with the number of guests serve as valuable indicators for the predictive model.
Phase 2 Provisioning¶
Data Requirements¶
For the Tortillas project, I use several datasets to add more features. These datasets include:
Guest Counts: A dataset containing daily guest counts for Tortillas serves as the primary input. This data reveals the restaurant's foot traffic over time, essential for prediction.
Weather data: Weather data are important for the tortillas industry, especially temperature and precipitation. These factors greatly influence consumer behavior and restaurants. Although temperature and precipitation are believed to have the most important effects, we will consider other climate variables to examine relationships and improve forecasts.
Effenaar Agenda: An Effenaar agenda dataset lists upcoming events and concerts, including artist, venue, and date/time details. This dataset allows consideration of local events' impact on restaurant attendance.
PSV Matches: Information about PSV matches, indicating home or away games and match types (e.g., Eredivisie, Champions League), is included. Sporting events draw crowds to the area, affecting restaurant traffic.
Other: Consideration of additional factors affecting restaurant attendance, such as holidays, local festivals, promotions, or community events, is imperative. While these data may not be readily accessible, it is essential to explore avenues for integrating relevant information into the analysis.
Data dictionary¶
I use multiple datasets for my project. So let me start it by the first "Tortillas" and work my way up there.
| Data Element Name | Data Type | Units | Range | Description | Source | Quality Standards | Notes |
|---|---|---|---|---|---|---|---|
| Date | Datetime | YYYY-MM-DD | January 2022 - now | The date of observation | Recorded data | There shouldn't be missing dates, double check | Generated with Excel |
| Day | Object (categorical) | Monday - Sunday | The day of the week corresponding to the date | Extracted calendar | "" | "" | |
| Season | "" | Summer - Winter | The season corresponding to the date | Internet | It should match the calender seasons | Looked on the internet for correct dates | |
| Total guests | Integer | Amount | 0 - 200 | The total number of guests recorded on the given date | Recorded data | Accurate counting of guests, double check | Pictures taken, entered manually in Excel sheet |
This is for the "weather" dataset
| Data Element Name | Data Type | Units | Range | Description | Source | Quality Standards | Notes |
|---|---|---|---|---|---|---|---|
| Date | Datetime | YYYY-MM-DD | January 2022 - now | The date of observation | KNMI | There shouldn't be missing dates, consistent | Copy/paste |
| Temperature | Float | Celsius (°C) | -10 to 35 | The maximum temperature recorded on the given date | "" | Very accurate | The highest temperatures are being meassured around 17h-18h (the time the restaurant opens) |
| Rain | "" | Millimeters | 0 - 50 | The amount of rainfall recorded on the given date | "" | "" | |
| Duration rain | "" | Hours | 0 - 24 | The duration of rainfall recorded on the given date | "" | "" | |
| Max rain | "" | Millimeters per hour | 0 - 50 | The maximum intensity of rainfall recorded on the given date | "" | "" | |
| Timezone max rain | Integer | Hours | 0 - 23 | The time of day (in hours) when the maximum rainfall intensity occurred | "" | "" | |
| Wind | Float | Meters per second | 0 - 15 | The wind speed recorded on the given date | "" | "" | |
| Timezone max wind | Integer | Hours | 0 - 23 | The time of day (in hours) when the maximum wind speed occurred | "" | "" |
This is for the "Effenaar" dataset.
| Data Element Name | Data Type | Units | Range | Description | Source | Quality Standards | Notes |
|---|---|---|---|---|---|---|---|
| Name | String | The name of the concert/event | Effenaar agenda | Not important | Manually copy/pasting in an Excel sheet, couldn't import it | ||
| Date | Datetime | YYYY-MM-DD | January 2022 - now | The date of the concert/event | "" | Double check input | "" |
| Time | Datetime | HH:MM | 12:00 - 23:59 | The time of day when the concert/event starts | "" | "" | "" |
| Location | Object (categorical) | The location/venue of the concert/event | "" | Contains 2 values: "Grote zaal" and "Kleine zaal" | "" | ||
| Sold out | Boolean | True/False | Indicates whether the concert/event is sold out | "" | No missing values | "" |
This is for the "PSV" dataset.
| Data Element Name | Data Type | Units | Range | Description | Source | Quality Standards | Notes |
|---|---|---|---|---|---|---|---|
| Date | Datetime | YYYY-MM-DD | January 2022 - now | The date of the PSV event | PSV fansite | Double check input | Manually copy/pasting in csv, couldn't import it |
| Time | Datetime | HH:MM | 00:00 - 23:59 | The time of day when the PSV event starts | "" | "" | "" |
| Type | Object (categorical) | The type/category of the PSV event | "" | contains values like, friendly match, league match, tournament | "" | ||
| Canceled | Boolean | True/False | Indicates whether the PSV event is canceled | "" | No missing values | "" |
This is the Holi- and Feastday dataset.
| Data Element Name | Data Type | Units | Range | Description | Source | Quality Standards | Notes |
|---|---|---|---|---|---|---|---|
| Date | Datetime | YYYY-MM-DD | January 2022 - now | The date of the holiday/feastday | Recorded data | Double check input | |
| Holiday | Object (categorical) | The name of the holiday | Rijksoverheid open data | Very accurate | API intergration | ||
| Feastday | Object (categorical) | The name of the feastday | Internet (search dates) | Double check input | Discussed with stakeholder which feastday could have an influence, can add custom |
The Netherlands (soccreteam) dataset.
| Data Element Name | Data Type | Units | Range | Description | Source | Quality Standards | Notes |
|---|---|---|---|---|---|---|---|
| Date | Datetime | YYYY-MM-DD | January 2022 - now | The date of the soccer match | The official onsoranje site | Double check input | Manually copy/pasting in csv, couldn't import it |
| Time | Datetime | HH:MM | 00:00 - 23:59 | The time of the soccer match | "" | "" | |
| Hometeam | Boolean | True/False | Indicates whether the Netherlands team is the home team | "" | No missing values | "" |
Data Collection¶
#Import libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.dates as mdates
import matplotlib.patches as mpatches
import math
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import max_error
from sklearn.metrics import mean_squared_error
from sklearn.svm import SVR
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import learning_curve
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor
import warnings
# Disable warnings
warnings.simplefilter(action='ignore')
Restaurant¶
The data was gathered directly from the restaurant using a system called Formitable. This platform allows me to effortlessly track the number of guests we've served each day. I then manually input this information into my Tortillas.xlsx file for record-keeping.
# Read the data
dfTortillas = pd.read_excel("Tortillas.xlsx")
dfTortillas.head()
| Date | Day | Season | Total guests | |
|---|---|---|---|---|
| 0 | 2024-03-17 | Sunday | Winter | 50 |
| 1 | 2024-03-16 | Saturday | Winter | 71 |
| 2 | 2024-03-15 | Friday | Winter | 53 |
| 3 | 2024-03-14 | Thursday | Winter | 55 |
| 4 | 2024-03-13 | Wednesday | Winter | 11 |
Date: Represents the specific date, formatted as YYYY-MM-DD.
Day: Indicates the day of the week corresponding to the date.
Season: Type of season for the day.
Total guests: Refers to the total number of guests served on the corresponding date.
dfTortillas.describe()
| Date | Total guests | |
|---|---|---|
| count | 807 | 807.000000 |
| mean | 2023-02-08 00:00:00 | 40.662949 |
| min | 2022-01-01 00:00:00 | 0.000000 |
| 25% | 2022-07-21 12:00:00 | 23.000000 |
| 50% | 2023-02-08 00:00:00 | 35.000000 |
| 75% | 2023-08-28 12:00:00 | 55.500000 |
| max | 2024-03-17 00:00:00 | 166.000000 |
| std | NaN | 25.184045 |
dfTortillas.dtypes
Date datetime64[ns] Day object Season object Total guests int64 dtype: object
dfTortillas.shape
(807, 4)
Weather¶
The weather data was collected from the Royal Netherlands Meteorological Institute (KNMI). KNMI provides historical weather data, including temperature, precipitation, and wind speed, which are usefull factors for this analysis. The data was downloaded in a structured format, which allowed for easy integration into the existing dataset. This comprehensive weather data provides valuable insights into how weather conditions may influence the number of guests visiting the restaurant.
# Read the data
dfWeather = pd.read_csv("Weather.csv", sep=",")
dfWeather.head()
| STN | YYYYMMDD | DDVEC | FHVEC | FG | FHX | FHXH | FHN | FHNH | FXX | ... | VVNH | VVX | VVXH | NG | UG | UX | UXH | UN | UNH | EV24 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 370 | 20220101 | 192 | 41 | 42 | 50 | 1 | 30 | 9 | 90 | ... | 2 | 75 | 13 | 8 | 84 | 91 | 2 | 78 | 23 | 4 |
| 1 | 370 | 20220102 | 222 | 60 | 66 | 100 | 21 | 40 | 2 | 170 | ... | 8 | 75 | 9 | 8 | 81 | 93 | 19 | 73 | 14 | 3 |
| 2 | 370 | 20220103 | 224 | 67 | 68 | 80 | 1 | 40 | 18 | 140 | ... | 5 | 75 | 14 | 8 | 84 | 90 | 7 | 77 | 1 | 2 |
| 3 | 370 | 20220104 | 236 | 28 | 37 | 60 | 12 | 20 | 16 | 100 | ... | 12 | 70 | 14 | 8 | 91 | 95 | 7 | 84 | 21 | 1 |
| 4 | 370 | 20220105 | 254 | 52 | 53 | 80 | 11 | 30 | 20 | 150 | ... | 16 | 73 | 4 | 6 | 85 | 95 | 21 | 74 | 11 | 2 |
5 rows × 41 columns
The weather dataset contains a wealth of information, including various meteorological factors. However, for the purpose of this analysis, we will focus on the features that are most likely to influence restaurant attendance. Specifically, we will consider the temperature, amount of rainfall, and wind speed. These factors are believed to have a significant impact on consumer behavior and, consequently, the number of guests visiting the restaurant.
dfWeather_selected = dfWeather[['YYYYMMDD', ' TX', ' RH', ' DR', ' RHX', ' RHXH', ' FG', ' FHXH']]
# Renaming columns for clarity
dfWeather_selected.columns = ['Date', 'Temperature', 'Rain', 'Duration rain', 'Max rain', 'Timezone max rain', 'Wind', 'Timezone max wind']
# Assuming dfWeather_selected is your DataFrame with the 'Date' column
dfWeather_selected['Date'] = pd.to_datetime(dfWeather_selected['Date'], format='%Y%m%d')
# Converting temperature from 0.1 degrees Celsius to degrees Celsius
dfWeather_selected.loc[:, 'Temperature'] = dfWeather_selected['Temperature']
dfWeather_selected["Temperature"] *= 0.1
# Converting rain from 0.1 mm to mm
dfWeather_selected.loc[:, 'Rain'] = dfWeather_selected['Rain']
dfWeather_selected["Rain"] *= 0.1
# Converting duration rain from 0.1 hours to hours
dfWeather_selected.loc[:, 'Duration rain'] = dfWeather_selected['Duration rain']
dfWeather_selected["Duration rain"] *= 0.1
# Converting max rain from 0.1 mm to mm
dfWeather_selected.loc[:, 'Max rain'] = dfWeather_selected['Max rain']
dfWeather_selected["Max rain"] *= 0.1
dfWeather_selected.loc[:, 'Timezone max rain'] = dfWeather_selected['Timezone max rain']
# Converting wind from 0.1 m/s to m/s
dfWeather_selected.loc[:, 'Wind'] = dfWeather_selected['Wind']
dfWeather_selected["Wind"] *= 0.1
dfWeather_selected.head()
| Date | Temperature | Rain | Duration rain | Max rain | Timezone max rain | Wind | Timezone max wind | |
|---|---|---|---|---|---|---|---|---|
| 0 | 2022-01-01 | 14.4 | 0.0 | 0.0 | 0.0 | 1 | 4.2 | 1 |
| 1 | 2022-01-02 | 13.7 | 2.7 | 4.8 | 0.6 | 19 | 6.6 | 21 |
| 2 | 2022-01-03 | 11.1 | 0.5 | 0.4 | 0.4 | 7 | 6.8 | 1 |
| 3 | 2022-01-04 | 8.5 | 3.9 | 3.9 | 1.7 | 15 | 3.7 | 12 |
| 4 | 2022-01-05 | 6.2 | 3.3 | 2.8 | 2.0 | 19 | 5.3 | 11 |
Date = Date (YYYY=year MM=month DD=day)
Temperature = Maximum temperature (degrees Celsius). The restaurant operates on weekdays starting at 5 p.m. and on weekends around 2 p.m. Therefore, it is sensible to select the maximum temperature recorded for each respective day.
Rain = Daily precipitation amount (in mm) (-0.1 for <0.05 mm)
Duration rain = Precipitation duration (in 1 hour)
Max rain = Maximum hourly precipitation amount (in mm) (-0.1 for <0.05 mm)
Timezone max rain = Hourly division in which Max rain was measured
Wind = Daily mean windspeed (in m/s)
dfWeather_selected.describe()
| Date | Temperature | Rain | Duration rain | Max rain | Timezone max rain | Wind | Timezone max wind | |
|---|---|---|---|---|---|---|---|---|
| count | 805 | 805.000000 | 805.000000 | 805.000000 | 805.000000 | 805.000000 | 805.000000 | 805.000000 |
| mean | 2023-02-07 00:00:00 | 15.979752 | 2.548447 | 1.965093 | 0.948199 | 6.744099 | 3.805839 | 10.961491 |
| min | 2022-01-01 00:00:00 | -1.300000 | -0.100000 | 0.000000 | -0.100000 | 1.000000 | 0.800000 | 1.000000 |
| 25% | 2022-07-21 00:00:00 | 10.400000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 2.600000 | 8.000000 |
| 50% | 2023-02-07 00:00:00 | 15.400000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 3.500000 | 11.000000 |
| 75% | 2023-08-27 00:00:00 | 22.100000 | 2.700000 | 2.600000 | 1.100000 | 13.000000 | 4.800000 | 14.000000 |
| max | 2024-03-15 00:00:00 | 38.300000 | 48.300000 | 20.800000 | 24.900000 | 24.000000 | 10.200000 | 24.000000 |
| std | NaN | 7.719364 | 5.443478 | 3.465313 | 2.076207 | 7.725651 | 1.765635 | 5.700856 |
dfWeather_selected.dtypes
Date datetime64[ns] Temperature float64 Rain float64 Duration rain float64 Max rain float64 Timezone max rain int64 Wind float64 Timezone max wind int64 dtype: object
dfWeather_selected.shape
(805, 8)
PSV¶
dfPsv = pd.read_csv("Psv.csv", sep=",", encoding="ISO-8859-1")
dfPsv.head()
| Date | Time | Type | Cancelled | |
|---|---|---|---|---|
| 0 | 2022-05-11 | 20:00 | Eredivisie | 0 |
| 1 | 2022-05-01 | 14:30 | Eredivisie | 0 |
| 2 | 2022-04-17 | 18:00 | KNVB Beker | 0 |
| 3 | 2022-04-14 | 18:45 | Europa Conference League | 0 |
| 4 | 2022-04-10 | 14:30 | Eredivisie | 0 |
Date: Represents the date on which the match was played, formatted as YYYY-MM-DD.
Time: Indicates the time at which the match started, typically in HH:MM format using the 24-hour clock.
Type: Specifies the type of match that took place, such as Eredivisie, KNVB Beker, or Champions League.
Cancelled: Denotes whether the match was cancelled or not, with 0 indicating that the match was not cancelled and 1 indicating that it was cancelled.
dfPsv.describe()
| Cancelled | |
|---|---|
| count | 66.000000 |
| mean | 0.015152 |
| std | 0.123091 |
| min | 0.000000 |
| 25% | 0.000000 |
| 50% | 0.000000 |
| 75% | 0.000000 |
| max | 1.000000 |
dfPsv.dtypes
Date object Time object Type object Cancelled int64 dtype: object
dfPsv.shape
(66, 4)
Effenaar¶
The Effenaar data was manually collected by the official site's calendar.
dfEffenaar = pd.read_excel("Effenaar.xlsx")
dfEffenaar.head()
| Name | Date | Time | Location | Sold out | |
|---|---|---|---|---|---|
| 0 | TechGlow | 2024-03-15 | 23:00:00 | Grote zaal | 0 |
| 1 | Pene Corrida | 2024-03-15 | 20:00:00 | Kleine zaal | 0 |
| 2 | Rowwen Hèze | 2024-03-14 | 20:15:00 | Grote zaal | 1 |
| 3 | Hypnotic Brass Ensemble | 2024-03-10 | 20:15:00 | Kleine zaal | 0 |
| 4 | Nacht van de Nederpop | 2024-03-10 | 20:15:00 | Grote zaal | 0 |
Name: Represents the name or title of the event.
Date: Indicates the date of the event, formatted as YYYY-MM-DD.
Time: Specifies the time at which the event starts, typically in HH:MM:SS format using the 24-hour clock.
Location: Refers to the venue or location where the event is held.
Sold out: Denotes whether the event is sold out or not, with 0 indicating that it is not sold out and 1 indicating that it is sold out.
dfEffenaar.describe()
| Date | Sold out | |
|---|---|---|
| count | 321 | 321.000000 |
| mean | 2023-05-26 13:36:26.915887872 | 0.261682 |
| min | 2022-09-01 00:00:00 | 0.000000 |
| 25% | 2022-12-17 00:00:00 | 0.000000 |
| 50% | 2023-05-07 00:00:00 | 0.000000 |
| 75% | 2023-11-09 00:00:00 | 1.000000 |
| max | 2024-12-29 00:00:00 | 1.000000 |
| std | NaN | 0.440237 |
dfEffenaar.dtypes
Name object Date datetime64[ns] Time object Location object Sold out int64 dtype: object
dfEffenaar.shape
(321, 5)
Holidays¶
I discussed with my stakeholder if he'd noticed any changes with the amount of customers when it's a holiday day or just a "normal" day. Then I searched the data for it and collected it manually.
# Only add the first page of the excel file
dfFeastDays = pd.read_excel("Holiday.xlsx", sheet_name=0)
dfHolidays = pd.read_excel("Holiday.xlsx", sheet_name=1)
dfFeastDays.head()
dfHolidays.head()
| Date | Holiday | |
|---|---|---|
| 0 | 2022-10-22 | Herfstvakantie |
| 1 | 2022-10-23 | Herfstvakantie |
| 2 | 2022-10-24 | Herfstvakantie |
| 3 | 2022-10-25 | Herfstvakantie |
| 4 | 2022-10-26 | Herfstvakantie |
Date: Indicates the date of the event, formatted as YYYY-MM-DD.
Feastday: The name of the feastday for that day.
Holiday: The name of the holiday for that day.
dfFeastDays.describe()
| Date | |
|---|---|
| count | 112 |
| mean | 2023-07-26 15:00:00 |
| min | 2022-02-14 00:00:00 |
| 25% | 2022-10-29 18:00:00 |
| 50% | 2023-08-12 12:00:00 |
| 75% | 2024-04-07 12:00:00 |
| max | 2024-12-05 00:00:00 |
dfFeastDays.dtypes
Date datetime64[ns] Feastday object dtype: object
dfFeastDays.shape
(112, 2)
dfHolidays.describe()
| Date | |
|---|---|
| count | 175 |
| mean | 2023-10-30 18:47:18.857142784 |
| min | 2022-10-22 00:00:00 |
| 25% | 2023-07-15 12:00:00 |
| 50% | 2023-10-14 00:00:00 |
| 75% | 2024-06-20 12:00:00 |
| max | 2024-08-18 00:00:00 |
dfHolidays.dtypes
Date datetime64[ns] Holiday object dtype: object
dfHolidays.shape
(175, 2)
Netherlands (soccer)¶
Not only PSV matches have an influence on the amount of guests, but also the Dutch team can influence the amount of people, according to my stakeholder. You can find it here
dfDutch = pd.read_csv('Netherlands.csv', sep=',', encoding="ISO-8859-1")
dfDutch.head()
| Date | Type | Hometeam | |
|---|---|---|---|
| 0 | 2022-03-26 | OEFENINTERLAND | True |
| 1 | 2022-03-29 | OEFENINTERLAND | True |
| 2 | 2022-06-03 | UEFA NATIONS LEAGUE | False |
| 3 | 2022-06-08 | UEFA NATIONS LEAGUE | False |
| 4 | 2022-06-11 | UEFA NATIONS LEAGUE | True |
Date: Indicates the date of the event, formatted as YYYY-MM-DD.
Type: Type of match for that match.
Holiday: Playing in the Netherlands for that match.
dfDutch.dtypes
Date object Type object Hometeam bool dtype: object
dfDutch.shape
(25, 3)
Version control¶
Version control is important, and GitHub provides an excellent platform for this. The initial version of the dataset contains records with dates starting from the beginning of 2024. After approval, the dataset expands, more observations can be added.
Data Understanding¶
🤔 Find out what is going on¶
Let's first examnie each dataset one by one (to get a better understanding) and later we are going to combine it.
Tortillas¶
# Define the order of the days
day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
# Convert 'Day' column to categorical data type with the defined order
dfTortillas['Day'] = pd.Categorical(dfTortillas['Day'], categories=day_order, ordered=True)
# Calculate the average total guests for each day
avg_total_guests_per_day = dfTortillas.groupby('Day', observed=False)['Total guests'].mean()
# Create a line plot
plt.figure(figsize=(10, 6))
avg_total_guests_per_day.plot(marker='o', color='b', linestyle='-')
plt.title('Average Total Guests per Day')
plt.xlabel('Day')
plt.ylabel('Average Total Guests')
plt.xticks(rotation=45)
plt.grid(True)
plt.show()
# Create a box plot
plt.figure(figsize=(10, 6))
sns.boxplot(x='Total guests', y='Day', data=dfTortillas, hue='Day', palette='Set3', dodge=False, legend=False)
plt.title('Distribution of Total Guests by Day')
plt.xlabel('Day')
plt.ylabel('Total Guests')
plt.grid(True)
plt.show()
You can observe that the day significantly influences the number of guests. The restaurant opens at 2 pm on weekends but at 5 pm on weekdays. Additionally, weekends (including Fridays) are particularly busy as they mark the beginning of the weekend. It makes sense, as most people tend to dine out on their free evenings. Plus, with more colleagues at work during weekends, it's something we've come to anticipate.
# Define custom color palette for each season
season_palette = {'Winter': 'blue', 'Spring': 'green', 'Summer': 'red', 'Autumn': 'orange'}
# Sort the DataFrame by season
sorted_seasons = ['Winter', 'Spring', 'Summer', 'Autumn']
dfTortillas['Season'] = pd.Categorical(dfTortillas['Season'], categories=sorted_seasons, ordered=True)
# Create a box plot
plt.figure(figsize=(10, 6))
sns.boxplot(x='Season', y='Total guests', data=dfTortillas, palette=season_palette, dodge=False)
plt.title('Distribution of Total Guests by Season')
plt.xlabel('Season')
plt.ylabel('Total Guests')
plt.grid(True)
plt.show()
The restaurant offers three main seating areas: downstairs, upstairs, and the terrace. While the terrace is bustling on warm days, it has fewer spots compared to upstairs. During colder weather, we typically open upstairs, but fewer people venture outside on winter days. So initially, I expected the overall patronage to remain similar. However, what's interesting is the presence of numerous outliers during autumn. We'll look at it why.
# Print rows containing autumn and the outliers only
print(dfTortillas[(dfTortillas['Season'] == 'Autumn') & (dfTortillas['Total guests'] > 100)])
Date Day Season Total guests 120 2023-11-18 Saturday Autumn 142 121 2023-11-17 Friday Autumn 119 127 2023-11-11 Saturday Autumn 120 141 2023-10-28 Saturday Autumn 115 148 2023-10-21 Saturday Autumn 125 161 2023-10-08 Sunday Autumn 166 484 2022-11-19 Saturday Autumn 121
All these days fall on weekends, which are typically the busiest on average. The top three days and the last one coincide with Glow, which I'll discuss later in relation to festival days. On October 8th, there's usually a surge in activity due to the marathon event. Perhaps on October 28th, there were two concerts scheduled, and on October 21st, there was a PSV match. I'll delve into these datasets later for further analysis.
To gain insight into how the number of guests is distributed per date, I'll create both a line and scatter plot visualizations. Given the variability in business from day to day, I anticipate that the data points will be scattered across the plot, reflecting the varying levels of busyness on different dates.
# Convert 'Date' column to numeric representing days since the start of the dataset
dfTortillas['NumericDate'] = (dfTortillas['Date'] - dfTortillas['Date'].min()).dt.days
plt.figure(figsize=(20, 10))
sns.lineplot(x='NumericDate', y='Total guests', data=dfTortillas)
sns.regplot(x='NumericDate', y='Total guests', data=dfTortillas, scatter=False, color='r', ci=None, line_kws={'label':'Trend Line'})
plt.title('Total Guests per Date with Trend Line')
plt.xlabel('Days Since Start of Dataset')
plt.ylabel('Total Guests')
plt.grid(True)
plt.show()
# Drop NumericDate
dfTortillas.drop('NumericDate', axis=1, inplace=True)
The day of the week significantly impacts the flow of visitors, with Saturdays emerging as the busiest, closely followed by Fridays. Sundays also tend to be bustling, albeit to a lesser extent. Conversely, weekdays witness noticeably fewer patrons, marking a quieter period for business. The graph is going up and down all over the place and the scatter matrix as well. However if you look closely, you can see some patterns exists. THere seems to be a small clusterm right before 01/01/2023. I'll examine that. Also I've added a regressionline, and you can see the amount of guests increases later the date. This could be because of COVID-19 (The first quarter was influenced by that a lot).
# Print rows before 01/01/2023 and sorted by date
print(dfTortillas[dfTortillas['Date'] < '2023-01-01'].sort_values(by='Date', ascending=False))
Date Day Season Total guests 442 2022-12-31 Saturday Winter 0 443 2022-12-30 Friday Winter 0 444 2022-12-29 Thursday Winter 0 445 2022-12-28 Wednesday Winter 0 446 2022-12-27 Tuesday Winter 0 .. ... ... ... ... 802 2022-01-05 Wednesday Winter 0 803 2022-01-04 Tuesday Winter 0 804 2022-01-03 Monday Winter 0 805 2022-01-02 Sunday Winter 0 806 2022-01-01 Saturday Winter 0 [365 rows x 4 columns]
On those days, the restaurant was closed, even though it wasn't its regular closing day. The boss had to manually input guests into the system, as the system wouldn't function otherwise. This situation raises the question of how to account for closing days in our prediction models, a topic I'll need to discuss further with my stakeholder to determine the best approach. NaN values means that the restaurant was closed. Filling it with 0 will mess up the machine learning, so not considering it in the model itself will be a valid solution for now.
# Scatter plot
plt.figure(figsize=(20, 10))
sns.scatterplot(x='Date', y='Total guests', data=dfTortillas, hue='Day')
plt.title('Total Guests per Date and day')
plt.xlabel('Date')
plt.ylabel('Total Guests')
plt.xticks(rotation=45)
plt.grid(True)
plt.show()
After discussing with my stakeholder, we decided to remove two specific instances when the restaurant was closed. The first closure was due to COVID-19, and the second occurred during a holiday. Predictions during these periods are unnecessary as the restaurant's status was already determined. Furthermore, we observed only one instance when the restaurant had zero customers, which happened way back, during winter when there was heavy snowfall. The likelihood of this scenario recurring is extremely low.
# Remove rows where total guest is 0
dfTortillas = dfTortillas[dfTortillas['Total guests'] != 0]
# Extract month and year
dfTortillas['Month'] = dfTortillas['Date'].dt.month
dfTortillas['Year'] = dfTortillas['Date'].dt.year
Weather¶
Creating line plots to identify any discernible patterns. As with the guest data, I expect the weather data to exhibit variability, reflecting the fluctuations in weather conditions from day to day.
# Visualize temperature across the year per date
plt.figure(figsize=(20, 10))
sns.lineplot(x='Date', y='Temperature', data=dfWeather_selected, color='red')
plt.title('Temperature per Date')
plt.xlabel('Date')
plt.ylabel('Temperature')
plt.xticks(rotation=45)
plt.grid(True)
plt.show()
# Visualize rain across the year per date
plt.figure(figsize=(20, 10))
sns.lineplot(x='Date', y='Rain', data=dfWeather_selected, color='blue')
plt.title('Rain per Date')
plt.xlabel('Date')
plt.ylabel('Rain')
plt.xticks(rotation=45)
plt.grid(True)
plt.show()
# Visualize wind across the year per date
plt.figure(figsize=(20, 10))
sns.lineplot(x='Date', y='Wind', data=dfWeather_selected, color='gray')
plt.title('Wind per Date')
plt.xlabel('Date')
plt.ylabel('Wind')
plt.xticks(rotation=45)
plt.grid(True)
plt.show()
You can see there is a clear pattern in the temperature plot, indicating that temperatures are generally warmer during the summer months compared to the winter months. This pattern aligns with common expectations. It's possible that during hotter weather, people may have less appetite or be less inclined to eat, while colder temperatures might encourage people to stay indoors. To explore this further, you could calculate the average temperature and examine any correlation between deviations from this average and guest counts.
dfWeather_selected['Temperature difference'] = dfWeather_selected['Temperature'] - 15
# Make the visualization with green and red bars for variance
plt.figure(figsize=(20, 10))
ax = sns.barplot(x='Date', y='Temperature difference', data=dfWeather_selected,
palette=[('blue' if x < 0 else 'red') for x in dfWeather_selected['Temperature difference']])
ax.xaxis.set_major_locator(mdates.MonthLocator())
ax.xaxis.set_major_formatter(mdates.DateFormatter('%b'))
plt.title('Temperature difference per Date')
plt.xlabel('Date')
plt.ylabel('Temperature difference')
plt.show()
# Apply absolute function to 'Temperature difference' column
dfWeather_selected['Temperature difference'] = dfWeather_selected['Temperature difference'].abs()
# Create a new dataframe for weather and tortillas combined
dfCombined = dfTortillas.merge(dfWeather_selected, how='inner', left_on='Date', right_on='Date')
dfCombined.head()
dfWeather_selected = dfWeather_selected.drop(columns=['Temperature difference'])
I'll merge the dataset with the Tortillas dataframe to see if there are any patterns in there.
# Make scatter plot for temperature and total guests
plt.figure(figsize=(10, 6))
sns.scatterplot(x='Temperature', y='Total guests', data=dfCombined, hue='Season')
plt.title('Total Guests vs Temperature')
plt.xlabel('Temperature')
plt.ylabel('Total Guests')
plt.grid(True)
plt.show()
There isn't a clear correlation or pattern between temperature and total guests. I'll attempt to analyze it using the average temperature and then comparing it with the temperature differences.
# Visualize difference temperature and total guest lineplot
plt.figure(figsize=(10, 6))
sns.lineplot(x='Temperature difference', y='Total guests', data=dfCombined)
plt.title('Total Guests vs Temperature difference')
plt.xlabel('Temperature difference')
plt.ylabel('Total Guests')
plt.grid(True)
plt.show()
# Drop temperature difference
dfCombined = dfCombined.drop(columns=['Temperature difference'])
Again, the same issue persists here. There doesn't appear to be any clear pattern. I experimented with calculating the temperature difference from 15 degrees (the average), but the data doesn't indicate a consistent trend. It seems uncertain whether the difference should lean towards being above or below that average.
# Visualize total guests and rain
plt.figure(figsize=(10, 6))
sns.scatterplot(x='Rain', y='Total guests', data=dfCombined, hue='Season')
plt.title('Total Guests vs Rain')
plt.xlabel('Rain')
plt.ylabel('Total Guests')
plt.grid(True)
plt.show()
You can see, there is a weak negative correlation between rain and total guests. That would make sense, because most of the people I know don't want to go outside while it's raining.
# Make histogram of timezone max rain
plt.figure(figsize=(10, 6))
sns.histplot(dfWeather_selected['Timezone max rain'], bins=20, kde=True)
plt.title('Timezone max rain')
plt.xlabel('Timezone max rain')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()
The problem with this histogram is that when there is no rain, the timezone is automatically set to 1. So to fix that, make a copy of the dataframe and remove the rows containing 0 rain.
# Make copy of weather df
dfWeather_selected_copy = dfWeather_selected.copy()
# Remove rows with Max rain =< 0
dfWeather_selected_copy = dfWeather_selected_copy[dfWeather_selected_copy['Max rain'] > 0]
# Make histogram of timezone max rain
plt.figure(figsize=(10, 6))
sns.histplot(dfWeather_selected_copy['Timezone max rain'], bins=24, kde=True)
plt.title('Timezone max rain')
plt.xlabel('Timezone max rain')
plt.ylabel('Frequency')
plt.grid(axis='y')
plt.show()
There's a noticeable uptick in the curve from 12:00 to 21:00, which is particularly significant because the restaurant is open from 17:00 on weekdays and 14:00 on weekends. Additionally, it stands to reason that rain would impact people's behavior, with fewer opting to go out and eat if the peak is in the afternoon. Hence, I'm interested in dividing the data into three groups to discern any potential patterns.
# Divide max rain into 3 groups, no rain, 12am-9pm, 9pm-12am
dfCombined.loc[dfCombined['Max rain'] <= 0, 'Max rain group'] = 'No rain'
dfCombined.loc[(dfCombined['Max rain'] > 0) & (dfCombined['Timezone max rain'] >= 12) & (dfCombined['Timezone max rain'] <= 21), 'Max rain group'] = '12:00 - 21:00'
dfCombined.loc[(dfCombined['Max rain'] > 0) & ((dfCombined['Timezone max rain'] > 21) | (dfCombined['Timezone max rain'] < 12)), 'Max rain group'] = '21:00 - 12:00'
# Sort them
dfCombined['Max rain group'] = pd.Categorical(dfCombined['Max rain group'], categories=['No rain', '12:00 - 21:00', '21:00 - 12:00'], ordered=True)
# Visualize Max rain group and total guests
plt.figure(figsize=(10, 6))
sns.boxplot(x='Total guests', y='Max rain group', data=dfCombined, hue='Max rain group', palette='Set2', dodge=False, legend=False)
plt.title('Distribution of Total Guests by Max Rain Group')
plt.xlabel('Total Guests')
plt.ylabel('Max Rain Group')
plt.grid(True)
plt.show()
My initial hypothesis was that if the peak of rainfall falls between 12:00 and 21:00, there would be fewer customers (as fewer people would be out in town, coinciding with the restaurant's opening hours), while the "no rain" condition would see the highest footfall, and the period from 21:00 to 12:00 would lie somewhere in between. Although the average for 12:00 to 21:00 is slightly lower, it's nearly the same across the board, suggesting that this initial assumption may not hold true.
# Print the amount of rows foreach category
print(dfCombined['Max rain group'].value_counts())
Max rain group No rain 408 21:00 - 12:00 191 12:00 - 21:00 164 Name: count, dtype: int64
There seems to be little correlation between weather conditions and the number of guests. However, I'm unable to investigate other correlation with the weather and other aspects of the restaurant (for example the type of products being sold). Therefore, for predicting the number of customers, weather may not be a reliable feature for the machine learning aspect.
dfWeather_selected.head()
| Date | Temperature | Rain | Duration rain | Max rain | Timezone max rain | Wind | Timezone max wind | |
|---|---|---|---|---|---|---|---|---|
| 0 | 2022-01-01 | 14.4 | 0.0 | 0.0 | 0.0 | 1 | 4.2 | 1 |
| 1 | 2022-01-02 | 13.7 | 2.7 | 4.8 | 0.6 | 19 | 6.6 | 21 |
| 2 | 2022-01-03 | 11.1 | 0.5 | 0.4 | 0.4 | 7 | 6.8 | 1 |
| 3 | 2022-01-04 | 8.5 | 3.9 | 3.9 | 1.7 | 15 | 3.7 | 12 |
| 4 | 2022-01-05 | 6.2 | 3.3 | 2.8 | 2.0 | 19 | 5.3 | 11 |
PSV¶
First of all, letch check if the data is okay, not out of shape or something.
# Convert 'Date' column to datetime format if it's not already
dfPsv['Date'] = pd.to_datetime(dfPsv['Date'])
# Make a copy of PSV
dfPsv_copy = dfPsv.copy()
# Extract year and month from the 'Date' column
dfPsv_copy['Year'] = dfPsv_copy['Date'].dt.year
dfPsv_copy['Month'] = dfPsv_copy['Date'].dt.month
# Select only the year 2022 and 2023
dfPsv_copy = dfPsv_copy[dfPsv_copy['Year'].isin([2022, 2023])]
# Group by year and month and count the number of games
games_per_year_month = dfPsv_copy.groupby(['Year', 'Month']).size().unstack(fill_value=0)
# Fill missing months with zeros
months = range(1, 13)
games_per_year_month = games_per_year_month.reindex(columns=months, fill_value=0)
# Define the width of each bar
bar_width = 0.35
# Define the x-coordinates for each group of bars
x = np.arange(len(games_per_year_month.columns))
# Plot the bar plot
plt.figure(figsize=(12, 6))
# Plot bars for each month for 2022 (gray color)
plt.bar(x - bar_width/2, games_per_year_month.loc[2022], width=bar_width, color='gray', label='2022')
# Plot bars for each month for 2023 (black color)
plt.bar(x + bar_width/2, games_per_year_month.loc[2023], width=bar_width, color='black', label='2023')
plt.title('Number of Games Played per Month')
plt.xlabel('Month')
plt.ylabel('Number of Games')
plt.xticks(ticks=x, labels=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])
plt.yticks(np.arange(0, games_per_year_month.max().max() + 1, step=1))
plt.legend()
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()
Overall it looks normal, not weird things going on. June falls within the off-season period where players get rest, so generally no official matches are played by PSV. Now we want to check if there is any correlation between PSV and the amount of guests at Tortillas.
# Merge dfTortillas with dfPsv
dfCombined = dfTortillas.merge(dfPsv, how='left', left_on='Date', right_on='Date')
# Make extra column with 'Game', containing true or false, based on Type being null or not
dfCombined['Game'] = ~dfCombined['Type'].isnull()
colors = {True: 'green', False: 'red'}
plt.figure(figsize=(20, 6))
sns.boxplot(data=dfCombined, x='Day', y='Total guests', hue='Game', palette=colors)
plt.xlabel('Day')
plt.ylabel('Amount of guests')
plt.title('Amount of guests, grouped by day, sorted by game')
plt.grid(axis='y', alpha=0.5)
plt.show()
You can see here overall, it attracts more people by a little. So this feature definetly should increase the number of guests. There is one outlier, I shall examine that.
# Filter day is Tuesday, Game is true and total guests < 25
outlier = dfCombined[(dfCombined['Day']=='Tuesday') & (dfCombined['Game']==True) & (dfCombined['Total guests'] < 25)]
outlier
| Date | Day | Season | Total guests | Month | Year | Time | Type | Cancelled | Game | |
|---|---|---|---|---|---|---|---|---|---|---|
| 751 | 2022-02-08 | Tuesday | Winter | 6 | 2 | 2022 | 19:00 | KNVB Beker | 0.0 | True |
This was during COVID-19, and it's probably a bad day for customers. So probably the best thing to do is remove the outlier to reduce noise.
# Remove PSV game on the date 2022-02-08
dfPsv = dfPsv[dfPsv['Date'] != '2022-02-08']
Now we can check if the time has any influence. First let's take a look how it's distributed.
# Convert 'Time' column to datetime format
dfCombined['Time'] = pd.to_datetime(dfCombined['Time'])
# Extract hour component from 'Time' column
dfCombined['Hour'] = dfCombined['Time'].dt.hour
# Create the histogram
plt.figure(figsize=(10, 6))
sns.histplot(data=dfCombined, x='Hour', kde=True, color='black')
plt.title('Histogram of Time')
plt.xlabel('Time (Hour)')
plt.ylabel('Frequency')
plt.xlim(12, 24)
plt.show()
We can divide it into 2 groups: x<17<y. Then we can group by day again and check if the time has any influence.
# Divide the hours into 2 groups: Before 17:00 and after 17:00
dfCombined.loc[dfCombined['Hour'] <= 17, 'Time group'] = 'Before 17:00'
dfCombined.loc[dfCombined['Hour'] > 17, 'Time group'] = 'After 17:00'
# Sort them
dfCombined['Time group'] = pd.Categorical(dfCombined['Time group'], categories=['Before 17:00', 'After 17:00'], ordered=True)
colors = {"Before 17:00": "skyblue", "After 17:00": "salmon"}
# Visualize Time gropu and total guests
plt.figure(figsize=(10, 6))
sns.boxplot(x='Total guests', y='Time group', data=dfCombined, hue='Time group', dodge=False, palette=colors)
plt.title('Distribution of Total Guests by Time Group')
plt.xlabel('Total Guests')
plt.ylabel('Time Group')
plt.grid(True)
plt.show()
So overall, time doesn't has a big influence, however after 17:00 it has 2 outliers.
print(dfCombined[(dfCombined['Time group'] == 'After 17:00') & (dfCombined['Total guests'] > 120)].to_string())
Date Day Season Total guests Month Year Time Type Cancelled Game Hour Time group 64 2024-01-13 Saturday Winter 123 1 2024 2024-05-17 21:00:00 Eredivisie 0.0 True 21.0 After 17:00 147 2023-10-21 Saturday Autumn 125 10 2023 2024-05-17 18:45:00 Eredivisie 0.0 True 18.0 After 17:00
There isn't a really good reason why there were peaks that days. It's both times on a Saturday and at 13 January 2024 there were also 2 concerts. So you could say those were just "good" days for the restaurant. They can stay in the dataset. Now you can look at the "type", it's possible to categorize the type in "Eredivisie" and "Other", and check if those have big influence or not.
# Create bar plot with stacked bars
plt.figure(figsize=(12, 8))
sns.countplot(data=dfCombined, x='Day', hue='Type')
plt.title('Frequency of Time Groups on Each Day')
plt.xlabel('Day')
plt.ylabel('Frequency/Count')
plt.xticks(rotation=0)
plt.legend(title='Time Group')
plt.grid(axis='y', alpha=0.5)
plt.show()
Many of the games are played on the weekends, when we typically see a higher number of customers. And if you look at the type it's also a thing Eredivisie has been played the most and it takes place in the weekend, the other type of matches are being held during the weekday. It's important to consider this when analyzing the data. Let's make a boxplot, grouped by day, to see matchdays attract more customers.
# Replace null values with "None" in the "Type" column
dfCombined['Type'] = dfCombined['Type'].fillna('None')
dfCombined.loc[(dfCombined['Type'] != 'Eredivisie') & (dfCombined['Type'] != 'None'), 'Type'] = 'Other'
# Create a FacetGrid
g = sns.FacetGrid(dfCombined, col='Day', hue='Type', col_wrap=3, height=4)
# Map the KDE plot onto the FacetGrid
g.map(sns.kdeplot, 'Total guests', fill=True)
g.set_titles("{col_name}")
g.set_xlabels("Total Guests")
g.set_ylabels("Density")
g.add_legend()
g.set(xlim=(0, None))
plt.show()
The amount of guests increases if there is a match going on, however the type mostly depends of the day, and day itself has a big influence. It's hard to distinguish if the type itself has any influence, there is too little data for that and looking at the graphs it's mostly the same (it shouldn't matter).
Concerts¶
First of all, letch check if the data is okay, not out of shape or something.
# Make a copy of Effenaar
dfEffenaar_copy = dfEffenaar.copy()
# Extract year and month from the 'Date' column
dfEffenaar_copy['Year'] = dfEffenaar_copy['Date'].dt.year
dfEffenaar_copy['Month'] = dfEffenaar_copy['Date'].dt.month
# Select only the year 2022 and 2023
dfEffenaar_copy = dfEffenaar_copy[dfEffenaar_copy['Year'].isin([2022, 2023])]
# Group by year and month and count the number of games
concerts_per_year = dfEffenaar_copy.groupby(['Year', 'Month']).size().unstack(fill_value=0)
bar_width = 0.35
# Define the x-coordinates for each group of bars
x = np.arange(len(concerts_per_year.columns))
# Plot the bar plot
plt.figure(figsize=(12, 6))
plt.bar(x - bar_width/2, concerts_per_year.loc[2022], width=bar_width, color='gray', label='2022')
plt.bar(x + bar_width/2, concerts_per_year.loc[2023], width=bar_width, color='black', label='2023')
plt.title('Concerts played per Month')
plt.xlabel('Month')
plt.ylabel('Number of Games')
plt.xticks(ticks=x, labels=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])
plt.yticks(np.arange(0, 30, step=2))
plt.legend()
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()
There is a flaw in this dataset: it only contains data from 01-09-2022 or higher. But there were definetly concerts before that day, only problem is it's not posted on the site. This is considered "missing not at random", this means that the missingness is related to the unobserved values themselves, rather than being completely random or related only to the observed data. Handling this is difficult, first I've tried to contact the Effenaar, sadly they didn't respond, so that option is off the table.
Handling missing data in your dataset presents several challenges. Ignoring the missing data could lead to inaccurate predictions, because the data spans over two years. However, attempting to fill in the missing values is near impoossible because the exact dates of concerts before September 2022 are unknown.
One option is to remove the rows with missing data, but this would result in discarding around 25% of your dataset. While this approach may seem drastic, it could simplify your analysis and ensure that the remaining data is reliable. Considering that the dataset spans a period that includes the impact of the COVID-19 pandemic, this may not significantly affect the overall predictive power of your model.
Another approach is to train two separate models: one using data before September 2022 and another using data from September 2022 onwards. This would capture different trends/patterns. However, since the goal is to predict future concert attendance, focusing solely on data from 2022 onwards might be the most pragmatic approach.
Before deciding, first have look between the correlation between the Effenaar and the number of guests.
# Merge Tortillas with Effenaar
dfCombined = dfTortillas.merge(dfEffenaar, how='left', left_on='Date', right_on='Date')
# Barplot per day, value of how many concerts there are, hue by location
plt.figure(figsize=(12, 6))
sns.countplot(data=dfCombined, x='Day', hue='Location', palette='dark:gray')
plt.title('Concerts per Day')
plt.xlabel('Day')
plt.ylabel('Amount of concerts')
plt.xticks(rotation=0)
plt.legend(title='Location')
plt.grid(axis='y', alpha=0.5)
plt.show()
It's logical that the concerts are mostly being held in the weekend, because people are free and have more time. Let's look at the sold out part. You expect that if the concerts are sold out, the more guests will arive. Also in a bigger hall fit more people, so you'd also expect that more people will come to the restaurant.
# Create a figure with two subplots
plt.figure(figsize=(18, 8))
# Define colors for True and False
colors = {1.0: 'green', 0: 'red'}
# Subplot 1: Combined box plot
plt.subplot(1, 2, 1)
sns.boxplot(data=dfCombined, x='Location', y='Total guests', hue='Sold out', palette=colors)
plt.title('Combined Boxplot of Total Guests by Location and Sold Out Status')
plt.xlabel('Location')
plt.ylabel('Total Guests')
legend_handles = [mpatches.Patch(color=color, label=label) for label, color in colors.items()]
plt.legend(handles=legend_handles, title='Sold Out', labels=['True', 'False'])
plt.ylim(0, 180)
# Subplot 2: Normal box plot
plt.subplot(1, 2, 2)
sns.boxplot(data=dfCombined, y='Total guests', color='gray')
plt.title('Boxplot of Total Guests')
plt.ylabel('Total Guests')
plt.ylim(0, 180)
# Adjust layout to avoid overlapping
plt.tight_layout()
# Show the plot
plt.show()
# Print row , location = grote zaal, guests > 135
print(dfCombined[(dfCombined['Location'] == 'Grote zaal') & (dfCombined['Total guests'] > 130)].to_string())
Date Day Season Total guests Month Year Name Time Location Sold out 140 2023-11-18 Saturday Autumn 142 11 2023 We All Love 80's 90's 00's 10's 20:00:00 Grote zaal 0.0
That day was Glow, so that's a big influence aswell. Overall, it doesn't matter if the location is in the Grote zaal or Kleine zaal, it's very similar.
The hypothisis for the hall dimensions are right, but for the sold out that seems not to be the case? Appearantly When it's sold out, less people come? Why?
# Convert the 'Sold out' column to numeric
dfCombined['Sold out'] = pd.to_numeric(dfCombined['Sold out'], errors='coerce')
# Filter out any non-numeric values in 'Sold out'
dfFiltered = dfCombined.dropna(subset=['Sold out'])
# Group by 'Day' and 'Sold out', then count occurrences
pivot_table = dfFiltered.groupby(['Day', 'Sold out']).size().unstack(fill_value=0)
colors = {True: 'green', False: 'red'}
# Plotting the data
plt.figure(figsize=(10, 6))
pivot_table.plot(kind='line', marker='o', color=[colors[val] for val in pivot_table.columns])
plt.xlabel('Day')
plt.ylabel('Number of Observations')
plt.title('Number of Observations Sold Out per Day')
plt.xticks(rotation=45)
plt.legend(title='Sold out', labels=['False', 'True'])
plt.grid(True)
plt.tight_layout()
plt.show()
<Figure size 1000x600 with 0 Axes>
The points (between sold out and not sold out) are closer to eachother during the weekday and further from eachother during the weekend. So there are relatively more observations during the weekday then during the weekend, which means that during the weekday less people tend to come then during the weekend. So the average also decreases. However if you are gonna look at it per "day", there is a good chance that "sold out" is being postively impacted.
# Replace null values with "None" in the "Sold out" column
dfCombined['Sold out'] = dfCombined['Sold out'].fillna('None')
# Replace 1 with 'True' and 0 with 'False' in the 'Sold out' column
dfCombined['Sold out'] = dfCombined['Sold out'].replace({1: 'True', 0: 'False'})
# Create a FacetGrid
g = sns.FacetGrid(dfCombined, col='Day', hue='Sold out', col_wrap=3, height=4)
# Map the KDE plot onto the FacetGrid
g.map(sns.kdeplot, 'Total guests', fill=True)
g.set_titles("{col_name}")
g.set_xlabels("Total Guests")
g.set_ylabels("Density")
g.add_legend()
g.set(xlim=(0, None))
plt.show()
This situation illustrates the Simpson's paradox. On a per-location basis, there's a clear negative correlation between guest numbers and sold-out instances. However, when examined daily, there appears to be no significant correlation.
You can merge the Effenaar and Tortillas dataframes 2 ways:
- Make a bool per date to check if a concert is playing.
- Make a count of how many concerts there are playing.
# Make a new column called "concert", if name is not null set true, else false
dfCombined['Concert'] = ~dfCombined['Name'].isnull()
# Print correlation between total guests and concert
print('Boolean: ')
print(dfCombined['Total guests'].corr(dfCombined['Concert']))
# remove rows befroe september 2022
dfCombined = dfCombined[dfCombined['Date'] >= '2022-09-01']
print('Boolean (afer September 2022): ')
print(dfCombined['Total guests'].corr(dfCombined['Concert']))
dfEffenaar_count = dfEffenaar.groupby('Date').size().reset_index(name='Effenaar_Count')
dfCombined = pd.merge(dfTortillas, dfEffenaar_count, on='Date', how='left')
# Fill NaN values with 0
dfCombined['Effenaar_Count'] = dfCombined['Effenaar_Count'].fillna(0)
print('Count: ')
print(dfCombined['Total guests'].corr(dfCombined['Effenaar_Count']))
# remove rows befroe september 2022
dfCombined = dfCombined[dfCombined['Date'] >= '2022-09-01']
print('Count (after September 2022): ')
print(dfCombined['Total guests'].corr(dfCombined['Effenaar_Count']))
Boolean: 0.37623058717913904 Boolean (afer September 2022): 0.3381984131480312 Count: 0.353880393686363 Count (after September 2022): 0.3377539185018148
So either way the correalation decreases after selecting the observations afters September 2022. This could make senese because if you look at the graph in the beginning (Tortillas guests over time), you could see the amount of guests increases over time. I also miss some data of the Effenaar in the beginning, which could lead to a more positive correlation than it in reality is. Also the count/boolean don't differ that much and for the user interface, it's simpeler to just create a checkmark if there is a concert going on, instead of entering the amount of concerts every day.
Data Preparation¶
Tortillas (side exercise)¶
In version 1.2 there are some missing values of the amount of guests for the Tortillas dataframe.
dfTortillas.isna().sum()
Date 0 Day 0 Season 0 Total guests 0 Month 0 Year 0 dtype: int64
It's my oversight. We use a system at work that displays daily guest counts. I had to snap pictures of it with my phone to input the data into an Excel sheet. However, when I reviewed the pictures at home, I noticed that some of the values were blurry and not clearly visible on my screen. Consequently, I couldn't accurately enter them into my Excel sheet. This is considered Missing at Random (MAR), an incidental occurrence during the image capture process, rather than being directly related to the guest counts.
# Calculate the percentage of missing values for the total guests
missing_total_guests = dfTortillas['Total guests'].isna().sum()
total_rows = dfTortillas.shape[0]
percentage_missing_total_guests = (missing_total_guests / total_rows) * 100
print(f'The percentage of missing values for the total guests is {percentage_missing_total_guests:.2f}%')
The percentage of missing values for the total guests is 0.00%
With nearly 20% of the data missing, that's quite a substantial portion. Let's create a histogram to visualize how these missing values are distributed throughout the dataset.
# make histogram
plt.figure(figsize=(10, 6))
sns.histplot(dfTortillas['Total guests'], bins=30, kde=True)
plt.title('Distribution of Total Guests')
plt.xlabel('Total Guests')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()
Filling in the missing values according to the day type is crucial. As seen in the initial data, weekends have more complete entries compared to weekdays. Therefore, we need to prioritize filling in the missing values based on whether it's a weekend or a weekday to ensure the dataset's integrity.
The steps:
- Fill values by day, prioritizing this feature for accuracy.
- Maintain the shape of the histogram; filling in missing values with averages would distort the histogram.
# Group the DataFrame by 'Day'
grouped = dfTortillas.groupby('Day')
# Define a function to fill missing values by interpolation for each group
def fill_missing(group):
# Interpolate missing values
group['Total guests'] = group['Total guests'].interpolate()
# Round interpolated values to the nearest integer
group['Total guests'] = group['Total guests'].apply(lambda x: np.round(x))
return group
df_filled = grouped.apply(fill_missing)
df_filled = df_filled.sort_values(by='Date')
df_filled = df_filled.reset_index(drop=True)
# Visualize total guests per day
plt.figure(figsize=(10, 6))
sns.boxplot(x='Total guests', y='Day', data=df_filled, hue='Day', palette='Set3', dodge=False, legend=False)
plt.title('Distribution of Total Guests by Day')
plt.xlabel('Day')
plt.ylabel('Total Guests')
plt.grid(True)
plt.show()
# Create histogram Tortillas
plt.figure(figsize=(10, 6))
sns.histplot(df_filled['Total guests'], bins=30, kde=True)
plt.title('Distribution of Total Guests')
plt.xlabel('Total Guests')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()
df_filled.isna().sum()
Date 0 Day 0 Season 0 Total guests 0 Month 0 Year 0 dtype: int64
dfTortillas = df_filled
As you can see, filling the values didn't affect the histogram and boxplot, now let's check the timeline.
# Make a line plot foreach date of how many guest there were
plt.figure(figsize=(20, 10))
sns.lineplot(x='Date', y='Total guests', data=df_filled)
plt.title('Total Guests per Date')
plt.xlabel('Date')
plt.ylabel('Total Guests')
plt.xticks(rotation=45)
plt.grid(True)
plt.show()
Interpolating generates clear patterns that wouldn't necessarily occur in real-life scenarios. However, it can still be utilized for the machine learning aspect as it provides reasonably good data.
Final df¶
To understand the functionality of the data, we initially need to consolidate it by merging relevant datasets. Once combined, we can explore potential correlations within the data. To facilitate this exploration, I've introduced a new column indicating the count of PSV matches and Effenaar concerts held on each respective day. This approach allows us to analyze any potential relationships or patterns between the occurrence of these events and other variables within the dataset.
# Convert 'Date' column to datetime format for other DataFrames
dfTortillas['Date'] = pd.to_datetime(dfTortillas['Date'])
dfPsv['Date'] = pd.to_datetime(dfPsv['Date'])
dfEffenaar['Date'] = pd.to_datetime(dfEffenaar['Date'])
# Count the number of events for each date in dfPsv and dfEffenaar
dfPsv_count = dfPsv.groupby('Date').size().reset_index(name='PSV_Count')
dfEffenaar_count = dfEffenaar.groupby('Date').size().reset_index(name='Effenaar_Count')
# Merge the counts with dfTortillas
combined_data = pd.merge(dfTortillas, dfPsv_count, on='Date', how='left')
df = pd.merge(combined_data, dfEffenaar_count, on='Date', how='left')
# Merge dfWeather_selected with finalDf
df = pd.merge(df, dfWeather_selected, on='Date', how='left')
# Merge df with holidays
df = pd.merge(df, dfHolidays, on='Date', how='left')
# Merge df with feastdays
df = pd.merge(df, dfFeastDays, on='Date', how='left')
df.head()
| Date | Day | Season | Total guests | Month | Year | PSV_Count | Effenaar_Count | Temperature | Rain | Duration rain | Max rain | Timezone max rain | Wind | Timezone max wind | Holiday | Feastday | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2022-01-26 | Wednesday | Winter | 28 | 1 | 2022 | NaN | NaN | 3.9 | 0.0 | 0.0 | 0.0 | 1.0 | 4.0 | 23.0 | NaN | NaN |
| 1 | 2022-01-27 | Thursday | Winter | 31 | 1 | 2022 | NaN | NaN | 8.9 | 1.2 | 3.6 | 0.4 | 9.0 | 5.7 | 7.0 | NaN | NaN |
| 2 | 2022-01-28 | Friday | Winter | 60 | 1 | 2022 | NaN | NaN | 9.1 | -0.1 | 0.0 | -0.1 | 1.0 | 3.9 | 23.0 | NaN | NaN |
| 3 | 2022-01-29 | Saturday | Winter | 75 | 1 | 2022 | NaN | NaN | 12.2 | -0.1 | 0.0 | -0.1 | 1.0 | 7.1 | 16.0 | NaN | NaN |
| 4 | 2022-01-30 | Sunday | Winter | 39 | 1 | 2022 | NaN | NaN | 8.8 | -0.1 | 0.0 | -0.1 | 1.0 | 3.4 | 24.0 | NaN | NaN |
It contains Null values, so we have to deal with that first.
# Fill NaN with 0 for PSV_Count
df['PSV_Count'] = df['PSV_Count'].fillna(0)
# Fill NaN with 0 for Effenaar_Count
df['Effenaar_Count'] = df['Effenaar_Count'].fillna(0)
# Fill missing FeastDays and Holidays with a placeholder value
df['Feastday'] = df['Feastday'].fillna('None')
df['Holiday'] = df['Holiday'].fillna('None')
# Remove rows containg Null values for temperature
df = df.dropna(subset=['Temperature'])
# Print NaN values
print(df.isna().sum())
Date 0 Day 0 Season 0 Total guests 0 Month 0 Year 0 PSV_Count 0 Effenaar_Count 0 Temperature 0 Rain 0 Duration rain 0 Max rain 0 Timezone max rain 0 Wind 0 Timezone max wind 0 Holiday 0 Feastday 0 dtype: int64
Phase 3 Predictions¶
🛠️ Preprocessing (Iteration 1)¶
💡 Feature selection¶
To predict the number of restaurant guests per day, it's helpful to create a correlation matrix. This matrix helps us see which features are improving the model.
numeric_df = df.drop(columns=['Date', 'Day', 'Season', 'Feastday', 'Holiday'])
# Create the correlation matrix
correlation_matrix = numeric_df.corr()
# Generate a heatmap of the correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5)
plt.title('Correlation Map')
plt.show()
# Perform one-hot encoding for categorical variables
df_encoded = pd.get_dummies(df, columns=['Day', 'Season', 'Feastday', 'Holiday'])
# Drop columns
df_encoded = df_encoded.drop(columns=['Date', 'PSV_Count', 'Effenaar_Count', 'Temperature', 'Rain', 'Duration rain', 'Max rain', 'Timezone max rain', 'Wind', 'Timezone max wind', 'Temperature'])
# Create the correlation matrix
correlation_matrix_encoded = df_encoded.corr()
# Select correlations related to 'Total guests'
total_guests_correlation_encoded = correlation_matrix_encoded['Total guests'].drop('Total guests')
# Plot the correlation values related to 'Total guests'
plt.figure(figsize=(20, 10))
colors = ['black', 'gray']
total_guests_correlation_encoded.plot(kind='barh', color=colors * (len(total_guests_correlation_encoded) // 2 + 1))
plt.title('Correlation with Total Guests')
plt.ylabel('Features')
plt.xlabel('Correlation')
plt.grid(axis='x')
plt.show()
So now that we've expanded the dataset with more features, we can pick them all to check how the model will perform.
# Define features and target variables
features = ['Day', 'PSV_Count', 'Effenaar_Count', 'Temperature', 'Rain', 'Duration rain', 'Max rain', 'Timezone max rain', 'Wind', 'Timezone max wind', 'Holiday', 'Feastday']
target = "Total guests"
X = df[features]
y = df[target]
🆔 Encoding¶
The features Day, Holiday, Feastday are currently categorical. To include those you need to encode it, because it are string types and the machine only reads numbers. This is not so easily done, it doesn't have order or ranking foreach category, so ranking/labelling isn't smart to do. One-hot encoding makes a column foreach different category and sets it a 0 or 1 (true or false), which will create a lot more columns, however the features are the same.
X = pd.get_dummies(X, columns=['Day', 'Holiday', 'Feastday'])
print(X.columns)
X.head()
Index(['PSV_Count', 'Effenaar_Count', 'Temperature', 'Rain', 'Duration rain',
'Max rain', 'Timezone max rain', 'Wind', 'Timezone max wind',
'Day_Monday', 'Day_Tuesday', 'Day_Wednesday', 'Day_Thursday',
'Day_Friday', 'Day_Saturday', 'Day_Sunday', 'Holiday_Herfstvakantie',
'Holiday_Kerstvakantie', 'Holiday_Meivakantie', 'Holiday_None',
'Holiday_Voorjaarsvakantie', 'Holiday_Zomervakantie',
'Feastday_Bevrijdingsdag', 'Feastday_Carnval',
'Feastday_Dutch Design Week', 'Feastday_Glow', 'Feastday_Guus Meeuwis',
'Feastday_Hemelvaart', 'Feastday_Koningsdag', 'Feastday_Marathon',
'Feastday_Moederdag', 'Feastday_None', 'Feastday_Pasen',
'Feastday_Pinksteren', 'Feastday_Sinterklaas', 'Feastday_Vaderdag',
'Feastday_Valentijnsdag'],
dtype='object')
| PSV_Count | Effenaar_Count | Temperature | Rain | Duration rain | Max rain | Timezone max rain | Wind | Timezone max wind | Day_Monday | ... | Feastday_Hemelvaart | Feastday_Koningsdag | Feastday_Marathon | Feastday_Moederdag | Feastday_None | Feastday_Pasen | Feastday_Pinksteren | Feastday_Sinterklaas | Feastday_Vaderdag | Feastday_Valentijnsdag | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0 | 0.0 | 3.9 | 0.0 | 0.0 | 0.0 | 1.0 | 4.0 | 23.0 | False | ... | False | False | False | False | True | False | False | False | False | False |
| 1 | 0.0 | 0.0 | 8.9 | 1.2 | 3.6 | 0.4 | 9.0 | 5.7 | 7.0 | False | ... | False | False | False | False | True | False | False | False | False | False |
| 2 | 0.0 | 0.0 | 9.1 | -0.1 | 0.0 | -0.1 | 1.0 | 3.9 | 23.0 | False | ... | False | False | False | False | True | False | False | False | False | False |
| 3 | 0.0 | 0.0 | 12.2 | -0.1 | 0.0 | -0.1 | 1.0 | 7.1 | 16.0 | False | ... | False | False | False | False | True | False | False | False | False | False |
| 4 | 0.0 | 0.0 | 8.8 | -0.1 | 0.0 | -0.1 | 1.0 | 3.4 | 24.0 | False | ... | False | False | False | False | True | False | False | False | False | False |
5 rows × 37 columns
🪓 Splitting into train/test¶
We keep 20% of the data aside for testing. The model learns from the other 80%. After training, we check how well the model predicts the outcomes for the 20% we set aside. This is mathematically the best distribution overall. If its predictions match well with this portion, the model is doing a good job.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=1)
print("There are in total", len(X), "observations, of which", len(X_train), "are now in the train set, and", len(X_test), "in the test set.")
There are in total 764 observations, of which 611 are now in the train set, and 153 in the test set.
⚖️ Scaling¶
Scaling is generally needed for nearest neighbors algorithm. It's important because they compare distances between data points to see how similar they are. Since they look at it in an absolute way, different scales can affect the model.
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
🧬 Modelling¶
knn = KNeighborsRegressor(n_neighbors=5)
knn.fit(X_train, y_train)
score = knn.score(X_test, y_test)
print("R²:", score)
R²: 0.4354713485462355
The model scores a 0.44, which is okay so far.
🔬 Evaluation¶
# Make predictions on the test set
y_pred = knn.predict(X_test)
# Create a DataFrame to compare truth, prediction, and error absolute
overview = pd.DataFrame({'Truth': y_test, 'Prediction': y_pred, 'Error': abs(y_test - y_pred)})
overview = overview.reset_index(drop=True)
overview.head()
| Truth | Prediction | Error | |
|---|---|---|---|
| 0 | 57 | 36.2 | 20.8 |
| 1 | 28 | 27.6 | 0.4 |
| 2 | 38 | 35.2 | 2.8 |
| 3 | 35 | 45.0 | 10.0 |
| 4 | 75 | 64.4 | 10.6 |
This evaluation wouldn't provide that much information. It's possible to create regression plot to make it visually more appealing.
# plot the regrossor line
plot = sns.regplot(y=y_test.values.flatten(), x=y_pred.flatten(), line_kws={"color": "r"})
plot.set_xlabel("Predicted amount")
plot.set_ylabel("True amount")
plot.set_title("")
plot
<Axes: xlabel='Predicted amount', ylabel='True amount'>
The dots are fairly not in the red area. There are some outliers in reality and the model never predicts the amount under 15 and above 90. The relation between the features and the target variable is not linear, so it's better to come up with something better instead.
me = max_error(y_test, y_pred)
me = math.ceil(me)
print("Max Error:", me)
mse = mean_squared_error(y_test, y_pred)
rmse = math.sqrt(mse)
rmse = math.ceil(rmse)
print("Root Mean Squared Error:", rmse)
Max Error: 52 Root Mean Squared Error: 19
🗳️ Conclusion.¶
The score of the model is 0.44, which is pretty inaccurate. Considering that this is a basic model using the nearest neighbors algorithm and all available features, the performance is acceptable. However, there is definitely room for improvement. Overall, this initial model provides a reasonable starting point, but looking at other techniques and models are important. On average, the machine's guesses are off by about 19 people compared to the actual number of guests. This is okay considering the restaurant's size. The outside area fits about 30 people, downstairs holds 42, and upstairs fits 38, but usually, only one of upstairs or outside is open at a time. So, when the restaurant is full, there are around 60 people seated. The machine predicts about a third wrong of this number, which could mean estimating either 40 or 80 guests on average. A difference of 20 guests could mean needing one extra or one fewer server or kitchen staff member to save costs or reduce stress.
🛠️ Preprocessing (Iteration 2)¶
This time, I've examined the PSV data and introduced a new dataframe named "Netherlands," focusing on the Dutch international soccer team. Upon analyzing the PSV data, I've concluded that the specific features themselves are not really making a big difference. Furthermore, no matches are played twice on the same day. Therefore, I've decided to convert the counts into boolean values. This time i'll try to improve the model by means of a Support vector machine.
# Add boolean column instead
df['PSV'] = df['PSV_Count'].apply(lambda x: True if x == 1 else False)
# Drop column PSV_Count
df = df.drop(columns=['PSV_Count'])
# Convert datetime
dfDutch['Date'] = pd.to_datetime(dfDutch['Date'])
# Merge dfTortillas with dfDutch on 'Date'
df = pd.merge(df, dfDutch[['Date', 'Type']], on='Date', how='left')
# Rename 'Type' column to 'Dutch'
df.rename(columns={'Type': 'Dutch'}, inplace=True)
# Fill NaN values with False
df['Dutch'] = df['Dutch'].fillna(False)
# Set type boolean
df['Dutch'] = df['Dutch'].astype(bool)
df.head()
| Date | Day | Season | Total guests | Month | Year | Effenaar_Count | Temperature | Rain | Duration rain | Max rain | Timezone max rain | Wind | Timezone max wind | Holiday | Feastday | PSV | Dutch | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2022-01-26 | Wednesday | Winter | 28 | 1 | 2022 | 0.0 | 3.9 | 0.0 | 0.0 | 0.0 | 1.0 | 4.0 | 23.0 | None | None | False | False |
| 1 | 2022-01-27 | Thursday | Winter | 31 | 1 | 2022 | 0.0 | 8.9 | 1.2 | 3.6 | 0.4 | 9.0 | 5.7 | 7.0 | None | None | False | False |
| 2 | 2022-01-28 | Friday | Winter | 60 | 1 | 2022 | 0.0 | 9.1 | -0.1 | 0.0 | -0.1 | 1.0 | 3.9 | 23.0 | None | None | False | False |
| 3 | 2022-01-29 | Saturday | Winter | 75 | 1 | 2022 | 0.0 | 12.2 | -0.1 | 0.0 | -0.1 | 1.0 | 7.1 | 16.0 | None | None | False | False |
| 4 | 2022-01-30 | Sunday | Winter | 39 | 1 | 2022 | 0.0 | 8.8 | -0.1 | 0.0 | -0.1 | 1.0 | 3.4 | 24.0 | None | None | False | False |
💡 Feature selection¶
Choosing the right features is crucial for Support Vector Regression (SVR). I'll select features that have a correlation above 0.1 (see the correlation matrix at the begin of iteration 1). Since I've introduced a new dataframe Dutch, it's possible to just calculate the correlation.
# Print correlation between Dutch and guests
correlation = df['Dutch'].corr(df['Total guests'])
correlation
0.038285238374651136
The correlation betwen "Total guests" and "Dutch" is 0.038, which makes it not good enough for the model. Also, I've noticed that as the date (year) goes up, more guests tend to arrive, so I'll include it too.
# Define features and target variables
features = ['Day', 'Effenaar_Count', 'PSV', 'Feastday', 'Year']
target = "Total guests"
X = df[features]
y = df[target]
🆔 Encoding¶
X = pd.get_dummies(X, columns=['Day', 'Feastday', 'Year'])
X.head()
| Effenaar_Count | PSV | Day_Monday | Day_Tuesday | Day_Wednesday | Day_Thursday | Day_Friday | Day_Saturday | Day_Sunday | Feastday_Bevrijdingsdag | ... | Feastday_Moederdag | Feastday_None | Feastday_Pasen | Feastday_Pinksteren | Feastday_Sinterklaas | Feastday_Vaderdag | Feastday_Valentijnsdag | Year_2022 | Year_2023 | Year_2024 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0 | False | False | False | True | False | False | False | False | False | ... | False | True | False | False | False | False | False | True | False | False |
| 1 | 0.0 | False | False | False | False | True | False | False | False | False | ... | False | True | False | False | False | False | False | True | False | False |
| 2 | 0.0 | False | False | False | False | False | True | False | False | False | ... | False | True | False | False | False | False | False | True | False | False |
| 3 | 0.0 | False | False | False | False | False | False | True | False | False | ... | False | True | False | False | False | False | False | True | False | False |
| 4 | 0.0 | False | False | False | False | False | False | False | True | False | ... | False | True | False | False | False | False | False | True | False | False |
5 rows × 27 columns
🪓 Splitting into train/test¶
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
print("There are in total", len(X), "observations, of which", len(X_train), "are now in the train set, and", len(X_test), "in the test set.")
There are in total 764 observations, of which 611 are now in the train set, and 153 in the test set.
🧬 Modelling¶
In this step we are going to use the trainset only to fit the model. In this case a Support Vector Machine for regression.
model = SVR()
model.fit(X_train, y_train)
# Print score
score = model.score(X_test, y_test)
print("R²:", score)
R²: 0.34643019118356055
The model's score flucuates between 0.30 and 0.40.
🔬 Evaluation¶
y_pred = model.predict(X_test)
me = max_error(y_test, y_pred)
me = math.ceil(me)
print("Max Error:", me)
mse = mean_squared_error(y_test, y_pred)
rmse = math.sqrt(mse)
rmse = math.ceil(rmse)
print("Root Mean Squared Error:", rmse)
Max Error: 69 Root Mean Squared Error: 21
The average error is 21.
🗳️ Conclusion.¶
The model performs worse than kNN, but that's without any tuning (the average error is 21). It's currently using 'rbf' as the kernel parameter, but considering the data's simplicity, a more linear approach might be more suitable. For instance, if PSV plays, it should logically attract more people in a linear fashion. Tuning the model could help improve its performance, although I don't expect a significant boost in the score. Still, it's worth trying. In the next iteration, I'll tune the model and check for overfitting.
🛠️ Preprocessing (Iteration 3)¶
This time I've examined the Effenaar data. You can the analysis check here. First of all I'm going to change the "Effenaar_Count", into a boolean. Because it doesn't change the correlation significantly and it's for the user more comfortable to just checkmark if there is a concert or not, instead of looking up how much there actually are.
# Add boolean column instead
df['Effenaar'] = df['Effenaar_Count'].apply(lambda x: True if x == 1 else False)
# Drop column PSV_Count
df = df.drop(columns=['Effenaar_Count'])
# only use dates after September 2022
df = df[df['Date'] >= '2022-09-01']
df.head()
| Date | Day | Season | Total guests | Month | Year | Temperature | Rain | Duration rain | Max rain | Timezone max rain | Wind | Timezone max wind | Holiday | Feastday | PSV | Dutch | Effenaar | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 218 | 2022-09-01 | Thursday | Summer | 32 | 9 | 2022 | 26.1 | 0.0 | 0.0 | 0.0 | 1.0 | 3.6 | 10.0 | None | None | False | False | True |
| 219 | 2022-09-02 | Friday | Summer | 94 | 9 | 2022 | 27.7 | 0.0 | 0.0 | 0.0 | 1.0 | 3.4 | 12.0 | None | None | False | False | False |
| 220 | 2022-09-03 | Saturday | Summer | 29 | 9 | 2022 | 28.4 | 0.0 | 0.0 | 0.0 | 1.0 | 2.3 | 1.0 | None | None | False | False | False |
| 221 | 2022-09-04 | Sunday | Summer | 33 | 9 | 2022 | 28.5 | 0.0 | 0.0 | 0.0 | 1.0 | 1.8 | 14.0 | None | None | False | False | False |
| 222 | 2022-09-05 | Monday | Summer | 14 | 9 | 2022 | 32.0 | 0.7 | 1.5 | 0.4 | 24.0 | 2.8 | 19.0 | None | None | False | False | False |
Also I've already discussed the part about shrinking the dataframe to only use data after September 2022 in the analysis. I'll try both of them to see if it makes a real difference or not.
💡 Feature selection¶
I'll pick the same features as for the previous model.
Day, Effenaar, PSV, Feastday, and Year.
# Define features and target variables
features = ['Day', 'Effenaar', 'PSV', 'Feastday', 'Year']
target = "Total guests"
X = df[features]
y = df[target]
🆔 Encoding¶
Year doesn't actually need encoding (because it can be used as a numerical value). I only have to encode Day and Feastday.
X = pd.get_dummies(X, columns=['Day', 'Feastday'])
X.head()
| Effenaar | PSV | Year | Day_Monday | Day_Tuesday | Day_Wednesday | Day_Thursday | Day_Friday | Day_Saturday | Day_Sunday | ... | Feastday_Hemelvaart | Feastday_Koningsdag | Feastday_Marathon | Feastday_Moederdag | Feastday_None | Feastday_Pasen | Feastday_Pinksteren | Feastday_Sinterklaas | Feastday_Vaderdag | Feastday_Valentijnsdag | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 218 | True | False | 2022 | False | False | False | True | False | False | False | ... | False | False | False | False | True | False | False | False | False | False |
| 219 | False | False | 2022 | False | False | False | False | True | False | False | ... | False | False | False | False | True | False | False | False | False | False |
| 220 | False | False | 2022 | False | False | False | False | False | True | False | ... | False | False | False | False | True | False | False | False | False | False |
| 221 | False | False | 2022 | False | False | False | False | False | False | True | ... | False | False | False | False | True | False | False | False | False | False |
| 222 | False | False | 2022 | True | False | False | False | False | False | False | ... | False | False | False | False | True | False | False | False | False | False |
5 rows × 25 columns
🪓 Splitting into train/test¶
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
print("There are in total", len(X), "observations, of which", len(X_train), "are now in the train set, and", len(X_test), "in the test set.")
There are in total 546 observations, of which 436 are now in the train set, and 110 in the test set.
🧬 Modelling¶
Again I'm going to use a SVR, but I'll try to tune the machine with hyperparameters. First of all the Kernel parameter, which determines the similarity between the input features. The default value is RBF (Radial Basis Function), which is used in the previous model. This is popular choice for non-linear problems. A linear kernel is a simple dot product between two input vectors, while a non-linear kernel is a more complex function that can capture more intricate patterns in the data (here). Since I have a dataset where the relationship between input features and the target variable is approximately linear (e.g. the amount of guests increases if there is a concert in the Effenaar), choosing the Linear value should have a better result. And the features that are being used are mostly boolean, which means it's either true or false.
model = SVR(kernel='linear')
model.fit(X_train, y_train)
score = model.score(X_test, y_test)
print("Best R² score:", score)
Best R² score: 0.5759751033233849
The model is performing better already! The next parameter is C. Basically, it determines the penalty for misclassified data points during the training process. The higher the C value, the margin becomes narrower, potentially leading to less misclafications, which will result in a higher accuracy. However there is also has a downside: This increased flexibility may result in overfitting if the data contains noise or outliers, making it unable to make accurate predictions on new data. A small C value can lead to underfitting, which results in poor performance on the training and test set (low accuracy). One way to check if the model is overfitted (and underfitted) is using cross-validation. I will combine grid search and cross-validation to find the optimal C value.
param_grid = {"C": [0.5, 1, 2, 3, 4, 5, 10, 20, 50]}
grid_search = GridSearchCV(model, param_grid, cv=4)
grid_search.fit(X_train, y_train)
C = grid_search.best_params_["C"]
model = grid_search.best_estimator_
print(f"Optimal C value for linear kernel SVR: {C}")
print(f"Score of the best model: {score}")
results = pd.DataFrame(grid_search.cv_results_)
results = results.sort_values(by=['rank_test_score'])
results.head()
Optimal C value for linear kernel SVR: 5 Score of the best model: 0.5759751033233849
| mean_fit_time | std_fit_time | mean_score_time | std_score_time | param_C | params | split0_test_score | split1_test_score | split2_test_score | split3_test_score | mean_test_score | std_test_score | rank_test_score | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5 | 0.011493 | 0.001144 | 0.002997 | 0.000036 | 5 | {'C': 5} | 0.581072 | 0.588014 | 0.542982 | 0.439555 | 0.537906 | 0.059314 | 1 |
| 6 | 0.014492 | 0.002181 | 0.004759 | 0.001469 | 10 | {'C': 10} | 0.598340 | 0.592122 | 0.540482 | 0.417495 | 0.537110 | 0.072620 | 2 |
| 4 | 0.009996 | 0.001412 | 0.004513 | 0.001513 | 4 | {'C': 4} | 0.576920 | 0.584398 | 0.544799 | 0.439692 | 0.536452 | 0.057811 | 3 |
| 3 | 0.009004 | 0.001235 | 0.005751 | 0.001795 | 3 | {'C': 3} | 0.574521 | 0.581678 | 0.546181 | 0.439331 | 0.535427 | 0.057047 | 4 |
| 7 | 0.022498 | 0.002689 | 0.004499 | 0.000497 | 20 | {'C': 20} | 0.600812 | 0.596635 | 0.535397 | 0.400078 | 0.533230 | 0.081120 | 5 |
According to the grid search, the highest possible cross-validation score is a 0.54, with a C value of 5. However, the model itself is scoring a 0.58. This suggests that the model is performing better on the training data than on unseen data during cross-validation. To understand this better, it's possible to do further evaluation.
🔬 Evaluation¶
y_pred = model.predict(X_test)
me = max_error(y_test, y_pred)
me = math.ceil(me)
print("Max Error:", me)
mse = mean_squared_error(y_test, y_pred)
rmse = math.sqrt(mse)
rmse = math.ceil(rmse)
print("Root Mean Squared Error:", rmse)
Max Error: 45 Root Mean Squared Error: 15
The average error is now 15 guests, with a maximum of 47. By visualizing the alignment of points between the actual and predicted values, we can how well the model performs across different ranges of data.
# Create scatter plot
plt.figure(figsize=(8, 6))
plt.scatter(y_test, y_pred, color='blue', alpha=0.5)
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], color='red', linestyle='--')
plt.title('Actual vs. Predicted')
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.show()
# Optional: Residual plot
residuals = y_test - y_pred
plt.figure(figsize=(8, 6))
plt.scatter(y_pred, residuals, color='orange', alpha=0.5)
plt.axhline(y=0, color='red', linestyle='--')
plt.title('Residual Plot')
plt.xlabel('Predicted')
plt.ylabel('Residuals')
plt.show()
There is no clear pattern in the residuals plot. It looks like the errors are randomly distributed, which is a sign of a well-fitted model. You can see that sometimes the model performs poorly, e.g. there are some outliers in errors (absolutely above 40). How can we evaluate further? Let's have a look at learning curves. Learning curves are graphs that tell us how well a model is doing as it learns from more data or over time. They're useful for figuring out how well algorithms learn bit by bit from the data they're given. Basically, we test the model on both the training and validation data, then make a graph showing how well it's doing.
- Training curve: The curve calculated from the training data; used to inform how well a model is learning.
- Validation curve: The curve calculated from the validation data; used to inform of how well the model is generalizing to unseen instances.
train_sizes, train_scores, test_scores = learning_curve(
estimator=model,
X=X,
y=y,
cv=4,
scoring="neg_root_mean_squared_error",
train_sizes = [1, 75, 150, 270, 330]
)
train_mean = -train_scores.mean(axis=1)
test_mean = -test_scores.mean(axis=1)
plt.subplots(figsize=(10,8))
plt.plot(train_sizes, train_mean, label="Train")
plt.plot(train_sizes, test_mean, label="Validation")
plt.title("Learning Curve")
plt.xlabel("Training Set Size")
plt.ylabel("RMSE")
plt.legend(loc="best")
plt.show()
Looking at the graph, it has a good shape: the training error is decreasing while the validation error is increasing. As the number of samples in the training data increases, the error tends to rise due to the increased complexity of fitting the model. Vice versa, for the validation data, more samples typically lead to decreased error as the model generalizes better (learns the pattern). The graph indicates that the model's generalization error is neither excessively wide, which would suggest overfitting, nor too narrow, which would indicate underfitting. Both the training and validation errors are relatively low, and the gap at the end is relatively small, indicating a well-fitted model. It seems that the training and validation lines will stabilize around an RMSE of 16/17. Adding more (training)data will not affect the model.
🗳️ Conclusion.¶
This model now shows the best score so far. I've performed the model multiple times (inlcuding the data from before September 2022) and the model performed a little worse. The generalization gap looked a wider and the error a bit higher, so only using data after September 2022 was a good choice. It looks like the model is well-fitted, making it reliable. To improve the accuracy you could consider:
- Try different estimators.
- Introduce more features.
For the next iteration I can look at new estimators (perhaps a random forest or a decision tree learning).
🛠️ Preprocessing (Iteration 4)¶
I didn't investigate any deeper in the data, so I'll perform this iteration with the same knowledge as before.
💡 Feature selection¶
I'll pick the same features as for the previous model.
Day, Effenaar, PSV, Feastday, and Year.
# Define features and target variables
features = ['Day', 'Effenaar', 'PSV', 'Feastday', 'Year']
target = "Total guests"
X = df[features]
y = df[target]
🆔 Encoding¶
The random forests itself doesn't require enocoding. However, I still have to handle categorical features as string type can't be read by the model. So I'll use One-Hot Encoding once again for the Day and Feastday.
X = pd.get_dummies(X, columns=['Day', 'Feastday'])
X.head()
| Effenaar | PSV | Year | Day_Monday | Day_Tuesday | Day_Wednesday | Day_Thursday | Day_Friday | Day_Saturday | Day_Sunday | ... | Feastday_Hemelvaart | Feastday_Koningsdag | Feastday_Marathon | Feastday_Moederdag | Feastday_None | Feastday_Pasen | Feastday_Pinksteren | Feastday_Sinterklaas | Feastday_Vaderdag | Feastday_Valentijnsdag | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 218 | True | False | 2022 | False | False | False | True | False | False | False | ... | False | False | False | False | True | False | False | False | False | False |
| 219 | False | False | 2022 | False | False | False | False | True | False | False | ... | False | False | False | False | True | False | False | False | False | False |
| 220 | False | False | 2022 | False | False | False | False | False | True | False | ... | False | False | False | False | True | False | False | False | False | False |
| 221 | False | False | 2022 | False | False | False | False | False | False | True | ... | False | False | False | False | True | False | False | False | False | False |
| 222 | False | False | 2022 | True | False | False | False | False | False | False | ... | False | False | False | False | True | False | False | False | False | False |
5 rows × 25 columns
🪓 Splitting into train/test¶
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
print("There are in total", len(X), "observations, of which", len(X_train), "are now in the train set, and", len(X_test), "in the test set.")
There are in total 546 observations, of which 436 are now in the train set, and 110 in the test set.
⚖️ Scaling¶
Random forests are based on decision trees, which are not sensitive to feature scaling. They (both of them) do not rely on distances between data points, therefore scaling is not necessary. So I will skip this part too.
🧬 Modelling¶
This time I will use a Random forest regressor, which consists of multiple decision trees. A decision tree continues splitting data based on features until it reaches a leaf node. A random forest combines multiple decision trees to make predictions, employing an ensemble technique. By using multiple decision trees, overfitting will be reduced, creating an average prediction. They handle noisy data very well.
model = RandomForestRegressor()
model.fit(X_train, y_train)
score = model.score(X_test, y_test)
print("Best R² score:", score)
cross_val_score = cross_val_score(model, X_train, y_train, cv=4)
print("Cross validation score:", cross_val_score)
print("Mean cross validation score:", cross_val_score.mean())
Best R² score: 0.5070410718996687 Cross validation score: [0.61581706 0.43192111 0.50353207 0.38808727] Mean cross validation score: 0.48483937645656244
A pretty good score, compared with the model's score and mean cross validation. Let's check the feature importance.
# Plot feature importance
importances = model.feature_importances_
indices = np.argsort(importances)[::-1]
plt.figure(figsize=(10, 6))
plt.bar(range(X.shape[1]), importances[indices])
plt.xticks(range(X.shape[1]), X.columns[indices], rotation=90)
plt.title("Feature Importance")
plt.show()
A lot of features have a very low impact on the model. Some feastdays and days don't tend to do it very well. So I will make the exact same regressor, but combine some features to make it better.
First starting with Feastdays, categorize them into 3 groups. high_feastdays, which have a correlation higher than 0.1 and low_feastdays that are below that. And ofcourse None, which isnt a feastday at all.
# Create a list of high correlation feastdays
high_feastdays = ['Guus Meeuwis', 'Glow', 'Dutch Design Week', 'Marathon']
# Create a list of low correlation feastdays
low_feastdays = ['Valentijnsdag', 'Carnval', 'Pasen', 'Koningsdag', 'Bevrijdingsdag', 'Moederdag', 'Hemelvaart', 'Pinksteren', 'Vaderdag', 'Sinterklaas']
# Map the feastdays to the categories
df.loc[df['Feastday'] == 'None', 'Feastday'] = 'None'
df.loc[df['Feastday'].isin(high_feastdays), 'Feastday'] = 'High'
df.loc[df['Feastday'].isin(low_feastdays), 'Feastday'] = 'Low'
Then for Day, combine into Weekday and Weekendday. Self explanatory.
# Convert 'Day' column to categorical with specified categories
df['Day'] = pd.Categorical(df['Day'], categories=['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'])
# Map days to 'Weekday' or 'Weekend'
df['Day'] = df['Day'].map(lambda x: 'Weekday' if x in ['Monday', 'Tuesday', 'Wednesday', 'Thursday'] else x)
df
| Date | Day | Season | Total guests | Month | Year | Temperature | Rain | Duration rain | Max rain | Timezone max rain | Wind | Timezone max wind | Holiday | Feastday | PSV | Dutch | Effenaar | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 218 | 2022-09-01 | Weekday | Summer | 32 | 9 | 2022 | 26.1 | 0.0 | 0.0 | 0.0 | 1.0 | 3.6 | 10.0 | None | None | False | False | True |
| 219 | 2022-09-02 | Friday | Summer | 94 | 9 | 2022 | 27.7 | 0.0 | 0.0 | 0.0 | 1.0 | 3.4 | 12.0 | None | None | False | False | False |
| 220 | 2022-09-03 | Saturday | Summer | 29 | 9 | 2022 | 28.4 | 0.0 | 0.0 | 0.0 | 1.0 | 2.3 | 1.0 | None | None | False | False | False |
| 221 | 2022-09-04 | Sunday | Summer | 33 | 9 | 2022 | 28.5 | 0.0 | 0.0 | 0.0 | 1.0 | 1.8 | 14.0 | None | None | False | False | False |
| 222 | 2022-09-05 | Weekday | Summer | 14 | 9 | 2022 | 32.0 | 0.7 | 1.5 | 0.4 | 24.0 | 2.8 | 19.0 | None | None | False | False | False |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 759 | 2024-03-11 | Weekday | Winter | 45 | 3 | 2024 | 8.4 | 21.0 | 18.8 | 3.3 | 13.0 | 2.0 | 10.0 | None | None | False | False | False |
| 760 | 2024-03-12 | Weekday | Winter | 20 | 3 | 2024 | 10.8 | 4.5 | 9.8 | 0.8 | 19.0 | 3.8 | 11.0 | None | None | False | False | False |
| 761 | 2024-03-13 | Weekday | Winter | 11 | 3 | 2024 | 13.2 | 1.6 | 3.7 | 0.5 | 6.0 | 4.2 | 15.0 | None | None | False | False | False |
| 762 | 2024-03-14 | Weekday | Winter | 55 | 3 | 2024 | 18.1 | 0.0 | 0.0 | 0.0 | 1.0 | 3.5 | 11.0 | None | None | False | False | True |
| 763 | 2024-03-15 | Friday | Winter | 53 | 3 | 2024 | 15.8 | 1.8 | 1.0 | 0.7 | 17.0 | 5.1 | 13.0 | None | None | False | False | False |
546 rows × 18 columns
I'll try the exact same model to check if it's improved by a bit.
from sklearn.model_selection import cross_val_score
# Define features and target variables
features = ['Day', 'Effenaar', 'PSV', 'Feastday', 'Year']
target = "Total guests"
X = df[features]
y = df[target]
X = pd.get_dummies(X, columns=['Day', 'Feastday'])
#drop feastday low
X = X.drop(columns=['Feastday_Low'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
model = RandomForestRegressor()
model.fit(X_train, y_train)
score = model.score(X_test, y_test)
print("Best R² score:", score)
# Apply cross validation
cross_val_score = cross_val_score(model, X_train, y_train, cv=4)
print("Cross validation score:", cross_val_score)
print("Mean cross validation score:", cross_val_score.mean())
# Plot feature importance
importances = model.feature_importances_
indices = np.argsort(importances)[::-1]
plt.figure(figsize=(10, 6))
plt.bar(range(X.shape[1]), importances[indices])
plt.xticks(range(X.shape[1]), X.columns[indices], rotation=90)
plt.title("Feature Importance")
plt.show()
Best R² score: 0.5040172930599268 Cross validation score: [0.53339567 0.55256865 0.51786315 0.44747321] Mean cross validation score: 0.5128251719780703
The model's performance has shown a slight improvement, along with better cross-validation results. Notably, the variance between each fold has decreased, suggesting that adjusting the Feastday and Day parameters has had a positive effect. By consolidating low-impact data into a single category, the model focuses on more general outcomes, enhancing its performance when tested with unseen data.
The score of an untuned random forest regressor is 0.51. The mean cross validation is 0.52, which would suggest it isn't overfitted. A random forest regressor has some parameters too:
n_estimators: The number of (decision trees) being used.max_depth: Maximum depth of the tree. A decision tree can split (based on featuers) and those nodes split again, and so on (until there are no samples left).min_samples_split: Minimum number of smaples required to split an internal node.min_samples_leave: Minimum number of samples required at a leaf node.
There are more parameters, but I will tune these ones above. I'm going to apply gridsearch again to find the "best" parameters (that will result in the highest accuracy), to see if the model can perfrom any better.
# Define the parameter grid
param_grid = {
'n_estimators': [100, 200, 300],
'max_depth': [None, 10, 20, 30],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4]
}
model = RandomForestRegressor()
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=4)
grid_search.fit(X_train, y_train)
best_params = grid_search.best_params_
model = grid_search.best_estimator_
score = model.score(X_test, y_test)
print(f"Score of the best model: {score}")
results = pd.DataFrame(grid_search.cv_results_)
results = results.sort_values(by=['rank_test_score'])
results.head()
Score of the best model: 0.5070699760418282
| mean_fit_time | std_fit_time | mean_score_time | std_score_time | param_max_depth | param_min_samples_leaf | param_min_samples_split | param_n_estimators | params | split0_test_score | split1_test_score | split2_test_score | split3_test_score | mean_test_score | std_test_score | rank_test_score | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 96 | 0.184252 | 0.006964 | 0.010759 | 0.001937 | 30 | 2 | 10 | 100 | {'max_depth': 30, 'min_samples_leaf': 2, 'min_... | 0.598884 | 0.590307 | 0.537782 | 0.447621 | 0.543648 | 0.060174 | 1 |
| 6 | 0.180761 | 0.049817 | 0.011760 | 0.004963 | None | 1 | 10 | 100 | {'max_depth': None, 'min_samples_leaf': 1, 'mi... | 0.596858 | 0.588801 | 0.540467 | 0.444945 | 0.542768 | 0.060455 | 2 |
| 8 | 0.494753 | 0.040719 | 0.024248 | 0.007393 | None | 1 | 10 | 300 | {'max_depth': None, 'min_samples_leaf': 1, 'mi... | 0.591509 | 0.591006 | 0.536667 | 0.449461 | 0.542161 | 0.057975 | 3 |
| 52 | 0.435297 | 0.145741 | 0.080745 | 0.078590 | 10 | 4 | 10 | 200 | {'max_depth': 10, 'min_samples_leaf': 4, 'min_... | 0.600162 | 0.579570 | 0.554635 | 0.434155 | 0.542130 | 0.064390 | 4 |
| 102 | 0.123488 | 0.001478 | 0.008259 | 0.001785 | 30 | 4 | 5 | 100 | {'max_depth': 30, 'min_samples_leaf': 4, 'min_... | 0.603258 | 0.571217 | 0.557620 | 0.435055 | 0.541788 | 0.063810 | 5 |
The score of the model increases to 0.51 the cross-validation score to 0.55 seems a bit weird. However, it's essential to consider the randomness inherent in both the data splits and the model's behavior during each fold of cross-validation. In this case, the dataset's small size has a big impact of these random variations, leading to notable differences between the folds. Despite the decrease in the model's score, the improved cross-validation performance suggests that the model might generalize better to unseen data. Therefore, while the model's overall performance may have decreased slightly, the higher cross-validation score indicates a potential reduction in overfitting, which is a positive outcome. Reasons the choose this model:
- While this model's overall score is slightly lower, it demonstrated improved performance during cross-validation, suggesting potentially better generalization to unseen data.
- It may be more robust and less prone to overfitting, making it a safer choice when deploying in real-world scenarios where the data distribution may vary. This model will probably be more reliable in reallife uses then the other model. (So I will evaluate on this model).
🔬 Evaluation¶
y_pred = model.predict(X_test)
me = max_error(y_test, y_pred)
me = math.ceil(me)
print("Max Error:", me)
mse = mean_squared_error(y_test, y_pred)
rmse = math.sqrt(mse)
rmse = math.ceil(rmse)
print("Root Mean Squared Error:", rmse)
Max Error: 54 Root Mean Squared Error: 16
The average error has increased slightly to 16.
🗳️ Conclusion.¶
The (tuned) random forest regressor performs pretty well. It scores a little worse then the support vector machine, but overall the model seems not overfitted and it's impressive the model performs very well looked at the features it uses.
Phase 4 Delivery¶
Demonstration¶
First of all let's check if a simple input works. It should have a date, Effenaar and PSV and Feastday input. For now we'll input the date, Effenaar and PSV manually, and for feastday there should be an option to select for the date which feastday it is (or make an auto input feature), but for now well do it manually aswell.
from datetime import datetime
# Create a dictionary to hold input values
input_values = {}
input_date_str = '2024-05-17'
input_date = datetime.strptime(input_date_str, '%Y-%m-%d')
day_of_week = input_date.strftime('%A')
year = input_date.year
# Set all boolean features to False
for feature in X.columns:
if X[feature].dtype == 'bool':
input_values[feature] = False
# Set other features (Year, Effenaar, Feastday_None) as you did before
input_values['Year'] = year
input_values['Effenaar'] = True
input_values['Feastday_None'] = True
if day_of_week in ['Monday', 'Tuesday', 'Wednesday', 'Thursday']:
input_values['Day_Weekday'] = True
elif day_of_week in ['Friday']:
input_values['Day_Friday'] = True
elif day_of_week in ['Saturday']:
input_values['Day_Saturday'] = True
else:
input_values['Day_Sunday'] = True
# Create a DataFrame with the input values
input_df = pd.DataFrame([input_values])
# Reorder columns in input_df to match the order of columns in X
input_df = input_df[X.columns]
print(input_df.head().to_string())
# Predict the number of guests for the input day using the trained model
predicted_guests = model.predict(input_df)
print("Predicted number of guests:", predicted_guests)
Effenaar PSV Year Day_Friday Day_Saturday Day_Sunday Day_Weekday Feastday_High Feastday_None 0 True False 2024 True False False False False True Predicted number of guests: [77.81044923]
This is an outline how it has to look like. Next I want to improve it by only setting a date and selecting stuff in a seperate window.
Feedback¶
Both the Support Vector Machine (SVM) and Random Forest Regressor performed well. While the SVM achieved higher accuracy, the Random Forest had a better cross-validation score, making it more reliable for real-world applications. Therefore, I recommend using the Random Forest model. It has an average error of 16 and a maximum error of 53. Considering our restaurant's capacity inside (ground floor) of 42 guests, an error of 16 is quite impressive, as it is less than half of our seating capacity. This level of accuracy can help us make decisions about staffing, such as whether to add or remove an extra employee for the evening.
Predicting the number of guests accurately is challenging because it's very flexible, but this model comes very close. It uses only a few features (Date, PSV, Effenaar, Feastday) and predicts guest numbers effectively, making it simple and reliable.
I discussed these results with my boss, who was impressed. He would definitely consider using the model if it is user-friendly. However, he might not fully rely on it for decision-making because of his extensive experience in the restaurant industry. He already has a good sense of the factors that influence guest numbers, such as certain days and events like PSV or Effenaar, and he is often accurate in his predictions.
To enhance this work, consider exploring ensemble learning by combining multiple models to improve accuracy and robustness. Additionally, enlarging the dataset could help increase the model's reliability. Finding more features that strongly correlate with guest numbers could further improve the model, although identifying such features might be challenging given the current dataset.